Section 19: Machine Learning

Lecture 88:Introduction to Machine Learning with R

ISLR is recommended. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allow computers to find hiddent insights without being explicitly programmed where to look.

  • fraud detection
  • web search results
  • real-time ads on web pages
  • credit scoring and next-best offers
  • prediction of equipment failures
  • new pricing models
  • network intrusion detection |
  • recommendation engines
  • customer segmentation
  • text sentiment analysis
  • predicting customer churn
  • pattern and image recognition
  • email spam filtering
  • financial modeling

General machine learning process is 1. data acquision 2. data cleaning 3. model training & building 4. model testing 5. model deployment

Supervised learning

Supervised learning are trained using labeled examples, such as input where the desired output is known.The learning lagorithm receives a set of inputs along with the corresponding correct outputs, and the algorith learns by comparing its actual output with correct outputs to find errors. It then modifies the model acordingly. - classification - regression - prediction - gradient boosting

Supervised learning is commonly used in application where historical data predicts likely future events.

Unsupervised learning

Unsupervised learning is used against data that has no historical labels. The system is not told the “right answer”. The algorithm must figure out what is bening shown. The goal is to explore the data and find some structure within.

Unsupervised learning can find the main attributes that separate customer segments from each other. Popular techniques include self-organizing maps, nearest-neighbor mapping, k-means clustering and singular value decomposition.

Reinforcement learning

Reinforcement learning is often used fro robotics, gaming and navigation. With reinforcement learning, te algorithm discovers through trial and error which actions yield the greatest rewards.

Section 20: Linear Regression

Lecture 88: History, Background and overview

The linear model formula takes the form (y~x). To add more predictor variables, just use + sign i.e., (y~x+y).

Example - model structure model <- lm (log(PINCP,base=10) ~ AGEP + SEX + COW + SCHL, data=dtrain) - predict dtest\(predLogPINCP <- predict(model, newdata=dtest) dtrain\)predLogPINCP <- predict(model, newdata=dtrain)

Lecture 89: Linear regression with R(part1)

Get the data from: http://archive.ics.uci.edu/ml/datasets/Student+Performance

# student math data
df <- read.csv("student-mat.csv",sep=";")
class(df)
## [1] "data.frame"
names(df)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
summary(df)
##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason   
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109  
##  Median :2.000   other   :141   other   :217   other     : 36  
##  Mean   :2.522   services:103   services:111   reputation:105  
##  3rd Qu.:3.000   teacher : 58   teacher : 29                   
##  Max.   :4.000                                                 
##    guardian     traveltime      studytime        failures      schoolsup
##  father: 90   Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344  
##  mother:273   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51  
##  other : 32   Median :1.000   Median :2.000   Median :0.0000            
##               Mean   :1.448   Mean   :2.035   Mean   :0.3342            
##               3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000            
##               Max.   :4.000   Max.   :4.000   Max.   :3.0000            
##  famsup     paid     activities nursery   higher    internet  romantic 
##  no :153   no :214   no :194    no : 81   no : 20   no : 66   no :263  
##  yes:242   yes:181   yes:201    yes:314   yes:375   yes:329   yes:132  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##      famrel         freetime         goout            Dalc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :1.000  
##  Mean   :3.944   Mean   :3.235   Mean   :3.109   Mean   :1.481  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Walc           health         absences            G1       
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 3.00  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
##  Median :2.000   Median :4.000   Median : 4.000   Median :11.00  
##  Mean   :2.291   Mean   :3.554   Mean   : 5.709   Mean   :10.91  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :5.000   Max.   :75.000   Max.   :19.00  
##        G2              G3       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00  
##  Mean   :10.71   Mean   :10.42  
##  3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :20.00
# no na data is stored in the data frame
any(is.na(df))
## [1] FALSE
str(df)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

We plot data using ggplot2

library(ggplot2)
library(ggthemes)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# data correlation detection
# num only (numerical columns only)
num.cols <- sapply(df,is.numeric)
# filter 
cor.data <- cor(df[,num.cols]) %>% print()
##                     age         Medu         Fedu   traveltime
## age         1.000000000 -0.163658419 -0.163438069  0.070640721
## Medu       -0.163658419  1.000000000  0.623455112 -0.171639305
## Fedu       -0.163438069  0.623455112  1.000000000 -0.158194054
## traveltime  0.070640721 -0.171639305 -0.158194054  1.000000000
## studytime  -0.004140037  0.064944137 -0.009174639 -0.100909119
## failures    0.243665377 -0.236679963 -0.250408444  0.092238746
## famrel      0.053940096 -0.003914458 -0.001369727 -0.016807986
## freetime    0.016434389  0.030890867 -0.012845528 -0.017024944
## goout       0.126963880  0.064094438  0.043104668  0.028539674
## Dalc        0.131124605  0.019834099  0.002386429  0.138325309
## Walc        0.117276052 -0.047123460 -0.012631018  0.134115752
## health     -0.062187369 -0.046877829  0.014741537  0.007500606
## absences    0.175230079  0.100284818  0.024472887 -0.012943775
## G1         -0.064081497  0.205340997  0.190269936 -0.093039992
## G2         -0.143474049  0.215527168  0.164893393 -0.153197963
## G3         -0.161579438  0.217147496  0.152456939 -0.117142053
##               studytime    failures       famrel    freetime        goout
## age        -0.004140037  0.24366538  0.053940096  0.01643439  0.126963880
## Medu        0.064944137 -0.23667996 -0.003914458  0.03089087  0.064094438
## Fedu       -0.009174639 -0.25040844 -0.001369727 -0.01284553  0.043104668
## traveltime -0.100909119  0.09223875 -0.016807986 -0.01702494  0.028539674
## studytime   1.000000000 -0.17356303  0.039730704 -0.14319841 -0.063903675
## failures   -0.173563031  1.00000000 -0.044336626  0.09198747  0.124560922
## famrel      0.039730704 -0.04433663  1.000000000  0.15070144  0.064568411
## freetime   -0.143198407  0.09198747  0.150701444  1.00000000  0.285018715
## goout      -0.063903675  0.12456092  0.064568411  0.28501871  1.000000000
## Dalc       -0.196019263  0.13604693 -0.077594357  0.20900085  0.266993848
## Walc       -0.253784731  0.14196203 -0.113397308  0.14782181  0.420385745
## health     -0.075615863  0.06582728  0.094055728  0.07573336 -0.009577254
## absences   -0.062700175  0.06372583 -0.044354095 -0.05807792  0.044302220
## G1          0.160611915 -0.35471761  0.022168316  0.01261293 -0.149103967
## G2          0.135879999 -0.35589563 -0.018281347 -0.01377714 -0.162250034
## G3          0.097819690 -0.36041494  0.051363429  0.01130724 -0.132791474
##                    Dalc        Walc       health    absences          G1
## age         0.131124605  0.11727605 -0.062187369  0.17523008 -0.06408150
## Medu        0.019834099 -0.04712346 -0.046877829  0.10028482  0.20534100
## Fedu        0.002386429 -0.01263102  0.014741537  0.02447289  0.19026994
## traveltime  0.138325309  0.13411575  0.007500606 -0.01294378 -0.09303999
## studytime  -0.196019263 -0.25378473 -0.075615863 -0.06270018  0.16061192
## failures    0.136046931  0.14196203  0.065827282  0.06372583 -0.35471761
## famrel     -0.077594357 -0.11339731  0.094055728 -0.04435409  0.02216832
## freetime    0.209000848  0.14782181  0.075733357 -0.05807792  0.01261293
## goout       0.266993848  0.42038575 -0.009577254  0.04430222 -0.14910397
## Dalc        1.000000000  0.64754423  0.077179582  0.11190803 -0.09415879
## Walc        0.647544230  1.00000000  0.092476317  0.13629110 -0.12617921
## health      0.077179582  0.09247632  1.000000000 -0.02993671 -0.07317207
## absences    0.111908026  0.13629110 -0.029936711  1.00000000 -0.03100290
## G1         -0.094158792 -0.12617921 -0.073172073 -0.03100290  1.00000000
## G2         -0.064120183 -0.08492735 -0.097719866 -0.03177670  0.85211807
## G3         -0.054660041 -0.05193932 -0.061334605  0.03424732  0.80146793
##                     G2          G3
## age        -0.14347405 -0.16157944
## Medu        0.21552717  0.21714750
## Fedu        0.16489339  0.15245694
## traveltime -0.15319796 -0.11714205
## studytime   0.13588000  0.09781969
## failures   -0.35589563 -0.36041494
## famrel     -0.01828135  0.05136343
## freetime   -0.01377714  0.01130724
## goout      -0.16225003 -0.13279147
## Dalc       -0.06412018 -0.05466004
## Walc       -0.08492735 -0.05193932
## health     -0.09771987 -0.06133460
## absences   -0.03177670  0.03424732
## G1          0.85211807  0.80146793
## G2          1.00000000  0.90486799
## G3          0.90486799  1.00000000
# data visualize of correlation

# package: corrgram, 
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.4
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
# corrplot - quickly visualize correlation
corrplot(cor.data, method='color')

# corrgram - pass into the data frame directly
corrgram(df)

# help("corrgram")

# ggplot visualize
ggplot(df, aes(x=G3))+geom_histogram(bins = 20, alpha=0.5, fill='blue')

Lecture 90: Linear regression with R(part2)

Split data into train and test set - we use install.packages(’caTools)

# caTools for data split

Math <- read.csv("student-mat.csv",sep=";")
head(Math)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
# install.packages('caTools')
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.4
# set a seed
set.seed(101)

Approach (1): caTools package

sample <- sample.split(Math$G3,SplitRatio=0.7)
train <- subset(df,sample==TRUE)
test <- subset(Math,sample=FALSE)

Approach (2): simple version using sample

# alternative - split of training/test set based on ISLR
library(magrittr)
train_set <- sample(395,197)
train_set
##   [1] 279 117  59 192  45 170 322  33 337 257 370  83 317 339  92 310 219
##  [18] 174  20 232 373 217 127 163  80 347 359   6 148 266 168 189 301 327
##  [35] 272 274 302 228 159 167 262 240 142 383 365 314 338 211 198 157 335
##  [52] 271  35 104  79 394  97 363 186 180 115 242 237 101 133 258 144 259
##  [69] 112 371 181 286 203 233  52 103 114 207 387  94 171 269 173 131  70
##  [86] 226  16  90  12 368 214 183 307 145 201  77  36  22 197 151 350 209
## [103]  71  62 130 118 175 105  84 358 200 342 346  34  43 281 188 312 341
## [120] 220 216  42  47 121 178 106 323  23   1 251 378  75 111 328 372 205
## [137] 193 150 126 386  21 158 116 282  18 195 134 122 270  46 210 160 213
## [154]  31  67 138 275 140  32 343 351 110 238  14 318  28 352   4 227 329
## [171] 321 184 225 392 291 309 154  85 176 265 113 298 264 326 212  76 382
## [188] 222  10 384 285 331 247 353 244 147  88
print(head(Math,subset=train_set))
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
print(head(Math,subset=!train_set))
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
# linear model fit - all variables
lm.fit <- lm(G3~.,data=Math, subset=train_set)
lm.fit <- lm(G3~., data=train)

print(summary(lm.fit))
## 
## Call:
## lm(formula = G3 ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4250 -0.6478  0.2844  1.0442  4.9840 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.70763    2.69488   1.376  0.17019    
## schoolMS          0.66981    0.47436   1.412  0.15926    
## sexM              0.25730    0.29257   0.879  0.38006    
## age              -0.36163    0.12949  -2.793  0.00566 ** 
## addressU          0.08123    0.35652   0.228  0.81996    
## famsizeLE3        0.12222    0.28709   0.426  0.67070    
## PstatusT          0.06807    0.43032   0.158  0.87444    
## Medu              0.11100    0.18757   0.592  0.55455    
## Fedu             -0.16373    0.15928  -1.028  0.30503    
## Mjobhealth       -0.63993    0.65314  -0.980  0.32820    
## Mjobother        -0.15730    0.42323  -0.372  0.71048    
## Mjobservices     -0.15872    0.46682  -0.340  0.73415    
## Mjobteacher      -0.04930    0.62335  -0.079  0.93702    
## Fjobhealth        0.17565    0.83034   0.212  0.83265    
## Fjobother        -0.29559    0.56012  -0.528  0.59818    
## Fjobservices     -0.76964    0.59476  -1.294  0.19692    
## Fjobteacher      -0.27009    0.73824  -0.366  0.71480    
## reasonhome       -0.41126    0.31857  -1.291  0.19799    
## reasonother       0.06767    0.45323   0.149  0.88144    
## reasonreputation  0.13478    0.34735   0.388  0.69834    
## guardianmother   -0.05442    0.31663  -0.172  0.86369    
## guardianother     0.01588    0.58375   0.027  0.97832    
## traveltime       -0.02353    0.19540  -0.120  0.90427    
## studytime        -0.04294    0.16910  -0.254  0.79979    
## failures         -0.17219    0.19668  -0.875  0.38220    
## schoolsupyes      0.20742    0.42358   0.490  0.62481    
## famsupyes        -0.05329    0.27753  -0.192  0.84789    
## paidyes           0.31311    0.28284   1.107  0.26941    
## activitiesyes    -0.26104    0.26687  -0.978  0.32901    
## nurseryyes       -0.05345    0.31236  -0.171  0.86428    
## higheryes        -0.94298    0.74005  -1.274  0.20385    
## internetyes      -0.15834    0.37029  -0.428  0.66932    
## romanticyes      -0.30048    0.28115  -1.069  0.28627    
## famrel            0.36601    0.14609   2.505  0.01291 *  
## freetime          0.08386    0.14247   0.589  0.55668    
## goout            -0.12457    0.13306  -0.936  0.35015    
## Dalc             -0.16995    0.20659  -0.823  0.41153    
## Walc              0.21053    0.14963   1.407  0.16074    
## health            0.07805    0.09341   0.836  0.40426    
## absences          0.09547    0.02382   4.008 8.24e-05 ***
## G1                0.14259    0.07892   1.807  0.07206 .  
## G2                0.98859    0.06929  14.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.962 on 235 degrees of freedom
## Multiple R-squared:  0.8456, Adjusted R-squared:  0.8187 
## F-statistic: 31.39 on 41 and 235 DF,  p-value: < 2.2e-16
# coefficient plot
library(coefplot)
coefplot(lm.fit)

# residual value extract
res <- residuals(lm.fit)
class(res)
## [1] "numeric"
res <- as.data.frame(res)
head(res)
##           res
## 1   1.4684389
## 2   1.8826707
## 3   1.1866990
## 6  -2.2440152
## 9   0.5974865
## 11  0.8583062
# plot residual
plot(lm.fit$residuals)

ggplot(res,aes(x=res))+geom_histogram(fill="light blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lecture 91: Linear regression with R (Part3)

Advanced visualization can be applied for model interprettion. 1. Residual vs Fitted 2. Normal Q-Q 3. Scale-location 4. Residual vs Leverage

plot(lm.fit)

The model is used for prediction using the following code.

# Prediction
pred <- predict(lm.fit,test)

# column bind
results <- cbind(pred,test$G3)
results <- as.data.frame(results)
colnames(results) <- c('predicted','actual')
head(results)
##   predicted actual
## 1  4.531561      6
## 2  4.117329      6
## 3  8.813301     10
## 4 12.682507     15
## 5  9.433677     10
## 6 17.244015     15
plot(results$predicted,results$actual)

# take care of negative values
to_zero <- function(x){
  if (x<0){
    return(0)
  }else{
    return(x)
    }
  }
# apply zero function
results$pred <- sapply(results$pred,to_zero)

# mean squared error
MSE <- mean ((results$actual-results$predicted)^2) %>% print()
## [1] 3.523079
MSE
## [1] 3.523079
# RMSE
print(MSE^0.5)
## [1] 1.876987
SSE <- sum((results$predicted - results$actual)^2)
SST <- sum((mean(Math$G3 - results$actual)^2))
R2 <- 1 - SSE/SST
R2
## [1] -Inf

Lecture 92: Linear Regression Project

We first explore data to conduct a linear regression.

Get the data

donwload the data or just use the supplied csv in te repository. The data has a following features: - datetime: hourly date + timestamp - season: 1 = sprint, 2=summer,3=fall,4=winter - holiday: whether the day - workingday - whether the day is neither a weekend nor holiday - weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp - temperature in Celsius - atemp - “feels like” temperature in Celsius - humidity - relative humidity - windspeed - wind speed - casual - number of non-registered user rentals initiated - registered - number of registered user rentals initiated - count - number of total rentals

bike <- read.csv('bikeshare.csv')
head(bike)
##              datetime season holiday workingday weather temp  atemp
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
##   humidity windspeed casual registered count
## 1       81    0.0000      3         13    16
## 2       80    0.0000      8         32    40
## 3       80    0.0000      5         27    32
## 4       75    0.0000      3         10    13
## 5       75    0.0000      0          1     1
## 6       75    6.0032      0          1     1

The purpose of our analysis is to predict count variable in te data set.

Exploratory Data Analysis

First we create a scatter plot of count vs temp. - count: number of total rentals - temp: temperature in Celsius

# ggplot visualization
library(ggplot2)

ggplot(data=bike,aes(x=temp,y=count))+geom_point(aes(color=temp,alpha=0.1))+geom_smooth()+theme_bw()
## `geom_smooth()` using method = 'gam'

Next, We plot count versus datetime as a scatterplot with a color gradient based on temparature. You’ll need to convert the datetime column into POSIXct before plotting.

library(ggplot2)
ggplot(bike,aes(x=datetime,y=count))+geom_point(aes(color=temp,alpha=0.2))+scale_color_continuous(low="blue",high="orange")+theme_bw()

# reference 
# ggplot(bike,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5)  + scale_color_continuous(low='#55D8CE',high='#FF6E2E') +theme_bw()

Hopefully, we notice two things: (1) A seasonality to the data for winter and summer. (2) Bike rental counts are increasing in general.

The above might present a problem with using a linear regression model if the data is non-linear. Let’s have a quick overview of pros/cons of linear regression.

Pros: - Simple to explain - Highly interpretable - Model training and prediction are fast - No tuning is required (excluding regularization) - Features don’t need scaling - Can perform well with a small number of observations - Well-understood

Cons: - Assumes a linear relationship between the features and the response - Performance is (generally) not competitive with the best supervised learning methods due to high bias - Can’t automatically learn feature interactions

We’ll keep this in mind as we continue on. Maybe when we lear more algorithms, we can come back to this with some new tools, for now we’ll stick to linear regression.

Next, we will calculate correlation between temp and count.

cor.data <- cor(bike[,c("temp","count")])
cor.data
##            temp     count
## temp  1.0000000 0.3944536
## count 0.3944536 1.0000000
library(corrgram)
library(corrplot)

# corrplot - quickly visualize correlation
corrplot(cor.data, method='color')

# corrgram - pass into the data frame directly
corrgram(bike)

# help("corrgram")

Let’s explore te season data. Create a boxplot, with the y axis indicating count and the x axis begin a box for each season.

ggplot(bike,aes(x=factor(season),y=count))+geom_boxplot(aes(colour=factor(season)))

With this visualization, a line cannot capture a non-linear relatinship. Plus, there are more rentals in winter than in spring. We know of thse issues because of the growth of rental count, this isn’t due to the actual season.

Feature engineering

A lot of times you’ll need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column.

First, we create an hour column that takes the hour from the datetime column. You’ll probably need to apply some function to the entire datetime column and reassign it.

Hint: time.stamp = bike$datetime[4], format(time.stamp,“%H”)

library(magrittr)
time.stamp <- bike$datetime[4] %>% as.Date() %>% print()
## [1] "2011-01-01"
format(time.stamp,"%H")
## [1] "00"
#as.POSIXct() function - CONVERT to POSIXct()
bike$datetime <- as.POSIXct(bike$datetime)

# sapply()
bike$hour <- sapply(bike$datetime,function(x){format(x,"%H")})
head(bike)
##              datetime season holiday workingday weather temp  atemp
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
##   humidity windspeed casual registered count hour
## 1       81    0.0000      3         13    16   00
## 2       80    0.0000      8         32    40   01
## 3       80    0.0000      5         27    32   02
## 4       75    0.0000      3         10    13   03
## 5       75    0.0000      0          1     1   04
## 6       75    6.0032      0          1     1   05

Now we crate a scatterplot count versus hour, with color scale based on temp. Only use bike data where workingday==1. Optional additions: - use the additional layer: scale_color_gradient(colors=c(‘color1’,color2)) - Use position=positio_jitter(w=1,w=0) inside of geom_point and check out what it does.

library(dplyr)
library(ggplot2)

pl <- ggplot(filter(bike,workingday==1),aes(hour,count)) 
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.5)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()

Now, we create the same plot for non working days:

pl <- ggplot(filter(bike,workingday==0),aes(hour,count)) 
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.8)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()

Working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon. Now let’s continue by trying to build a model, we’ll begin by just looking at a single feature.

Building a linear model

We use lm() to build a model that predicts count based solely on the temp feature, name it temp.model.

temp.model <- lm(count~temp,data=bike)
summary(temp.model)
## 
## Call:
## lm(formula = count ~ temp, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.32 -112.36  -33.36   78.98  741.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0462     4.4394   1.362    0.173    
## temp          9.1705     0.2048  44.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16

We got 6.0462 as the intercept and 9.17 as the temp coefficient.

Intercept

  • it is the value of y when x=0
  • Thus, it is the estimated number of rentals when the temperature is 0.
  • Note: it does not always make sense to interpret the intercept

coefficient

  • It is the change in y divided by change in x, or the “slope”
  • Thus, a temparature increase of 1 degree is associated with a rental increase of 9.17 bikes.
  • This is not a statement of causation.
  • \[\beta_1\] would be negative if an increase in temperature was associated with a degrease in rentals.

Predictiing model

Next question is how many bike rentals would we predict if temperature was 25 degrees Celsius? Calculate this two was: (1) Use the values we just got above (2) Use predict() function

6.0462 + 9.17*25
## [1] 235.2962
predict(temp.model,data.frame(temp=c(25)))
##        1 
## 235.3097

WE use sapply as.numeric to change the hour column to a column of numeric values.

bike$hour <- sapply(bike$hour,as.numeric)

Finally, we build a model that attemps to predict count based off the following features. Figure out if there is a way not to pass/write all these variables into the lm() funcion.

lm.model2 <- lm(count~.-casual-registered-datetime-atemp,data=bike)
summary(lm.model2)
## 
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp, 
##     data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.61  -96.88  -31.01   55.27  688.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
## season       21.70333    1.35409  16.028  < 2e-16 ***
## holiday     -10.29914    8.79069  -1.172    0.241    
## workingday   -0.71781    3.14463  -0.228    0.819    
## weather      -3.20909    2.49731  -1.285    0.199    
## temp          7.01953    0.19135  36.684  < 2e-16 ***
## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
## windspeed     0.20271    0.18639   1.088    0.277    
## hour          7.61283    0.21688  35.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16
library(coefplot)
coefplot(lm.model2)

A linear model like this one which uses OLS won’t be able to take into account seasonality of the data, and will get thrown in the dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing. Later on, we’ll see if other models may be a better fit for this sort of data.

Section 22: Logistic Regression

Lecture 95: Intoroduction

Logistic regression is used as method for classification examples - spam versus ham emails - loan default (yes,no) - disease diagnosis

Background

We have seen regression problesm where we try to predict continuous values. Although the name maybe confusion at first, logistic regression asslows us to solve classification problems, where we are trying to predict discrete categories. The convention for binary classification is to have two classes 0 and 1.

The sigmoid (aka logistic) function takes in any value and outputs it to be between 0 and 1. The sigmoid function is defined as below. \[ \phi(z) = \frac{1}{1+e^{-z}} \] This means we can take our linear regression solution and place it into the Sigmoid function. We can set a cutoff point at 0.5, anything below it results class 0, anything above is class 1.

We use the logistic function to output a value ranging from 0 to 1. Based off of this probability we assign a class.

Model evaluation

We can use a confusion matrix to evaluate our model. Testing for disease - Type I error: false positive - Type II error: false negative

Lecture 96: Logistic regression with R (Part1)

We use titanic data for applying logistic regression model. - Passenger ID - pclass: Passenger class - SibSp: Number of siblings on board/spouses - Parch: parents/children on board - Fare: ticket fare - Embarked: Queens Town

titanic.train <- read.csv('titanic_train.csv')
head(titanic.train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q
str(titanic.train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
dim(titanic.train)
## [1] 891  12

By using Amelia’s missmap package, we can see that age variable has multiple missing values.

# install.packages('Amelia')
# library(Amelia)
# missmap(titanic.train,main='Missing Map',col=c('yellow','black'),legend = FALSE)

Simply by bar chart, we can see the data distribution. - third class > first class > second class - most passengers are 20s and 30s (children on board)

# Pclass distribution
ggplot(titanic.train,aes(x=Pclass))+geom_bar(aes(fill=factor(Pclass)))

# Sex distribution
ggplot(titanic.train,aes(x=Sex))+geom_bar(aes(fill=factor(Sex)))

#  age distribution
ggplot(titanic.train,aes(x=Age))+geom_histogram(bins=20,alpha=0.3,fill='dark blue')
## Warning: Removed 177 rows containing non-finite values (stat_bin).

# SibSp - no Siblings/Spouses
ggplot(titanic.train,aes(SibSp))+geom_bar()

ggplot(titanic.train,aes(x=Fare))+geom_histogram(fill="light blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Options to deal with missing values for clean and tidy data are: (1) exclude all rows with missing value * For this case, the 170 rows are too much for exclusion. (2) Supplement missing values with some values e.g., fill in average age by passenger class

We will practice option 2.

pl <- ggplot(titanic.train,aes(x=Pclass,y=Age))
pl <- pl + geom_boxplot(aes(group=Pclass,fill=factor(Pclass),alpha=0.4))
pl
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).

pl + scale_y_continuous(breaks~seq(min(0),max(80),by=2))
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).

impute_age <- function(age,class){
    out <- age
    for (i in 1:length(age)){
        
        if (is.na(age[i])){

            if (class[i] == 1){
                out[i] <- 37

            }else if (class[i] == 2){
                out[i] <- 29

            }else{
                out[i] <- 24
            }
        }else{
            out[i]<-age[i]
        }
    }
    return(out)
}

Lecture 97: Logistic Regression with R (Part 2)

As continued from the previous lecture, we supplement average age values in the N/A rows in the train set.By doing this, we don’t have any row with missing values.

fixed.ages <- impute_age(titanic.train$Age,titanic.train$Pclass)
titanic.train$Age <- fixed.ages

# missmap(titanic.train,main="Imputation Check") #N/A values wre supplemented

As we do have several variables out of the analysis, we exclude those variables from the dataset.

str(titanic.train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 24 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
library(dplyr)
titanic.train <- select(titanic.train,-PassengerId,-Name,-Ticket,-Cabin)
str(titanic.train)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 24 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
dim(titanic.train)
## [1] 891   8
titanic.train$Survived <- as.factor(titanic.train$Survived)
titanic.train$Pclass <- as.factor(titanic.train$Pclass)
titanic.train$Parch <- as.factor(titanic.train$Parch)
titanic.train$SibSp <- as.factor(titanic.train$SibSp)

str(titanic.train)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 24 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
log.model <- glm(Survived~.,family=binomial(link='logit'),data=titanic.train)
summary(log.model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = titanic.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8158  -0.6134  -0.4138   0.5808   2.4896  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.845e+01  1.660e+03   0.011 0.991134    
## Pclass2     -1.079e+00  3.092e-01  -3.490 0.000484 ***
## Pclass3     -2.191e+00  3.161e-01  -6.930 4.20e-12 ***
## Sexmale     -2.677e+00  2.040e-01 -13.123  < 2e-16 ***
## Age         -3.971e-02  8.758e-03  -4.534 5.79e-06 ***
## SibSp1       8.135e-02  2.245e-01   0.362 0.717133    
## SibSp2      -2.897e-01  5.368e-01  -0.540 0.589361    
## SibSp3      -2.241e+00  7.202e-01  -3.111 0.001862 ** 
## SibSp4      -1.675e+00  7.620e-01  -2.198 0.027954 *  
## SibSp5      -1.595e+01  9.588e+02  -0.017 0.986731    
## SibSp8      -1.607e+01  7.578e+02  -0.021 0.983077    
## Parch1       3.741e-01  2.895e-01   1.292 0.196213    
## Parch2       3.862e-02  3.824e-01   0.101 0.919560    
## Parch3       3.655e-01  1.056e+00   0.346 0.729318    
## Parch4      -1.586e+01  1.055e+03  -0.015 0.988007    
## Parch5      -1.152e+00  1.172e+00  -0.983 0.325771    
## Parch6      -1.643e+01  2.400e+03  -0.007 0.994536    
## Fare         2.109e-03  2.490e-03   0.847 0.397036    
## EmbarkedC   -1.458e+01  1.660e+03  -0.009 0.992995    
## EmbarkedQ   -1.456e+01  1.660e+03  -0.009 0.993001    
## EmbarkedS   -1.486e+01  1.660e+03  -0.009 0.992857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  763.41  on 870  degrees of freedom
## AIC: 805.41
## 
## Number of Fisher Scoring iterations: 15
library(coefplot)
coefplot(log.model)

Next, we predict test data by splitting train data.

library(caTools)
set.seed(101)
split <- sample.split(titanic.train$Survived,SplitRatio = 0.7)
final.train <- subset(titanic.train,split==TRUE)
final.test <- subset(titanic.train,splot=TRUE)

final.log.model <- glm(Survived~.,family=binomial(link='logit'),data=final.train)
summary(final.log.model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = final.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8288  -0.5607  -0.4096   0.6174   2.4898  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.777e+01  2.400e+03   0.007 0.994091    
## Pclass2     -1.230e+00  3.814e-01  -3.225 0.001261 ** 
## Pclass3     -2.160e+00  3.841e-01  -5.624 1.87e-08 ***
## Sexmale     -2.660e+00  2.467e-01 -10.782  < 2e-16 ***
## Age         -3.831e-02  1.034e-02  -3.705 0.000212 ***
## SibSp1      -2.114e-02  2.755e-01  -0.077 0.938836    
## SibSp2      -4.000e-01  6.463e-01  -0.619 0.536028    
## SibSp3      -2.324e+00  8.994e-01  -2.584 0.009765 ** 
## SibSp4      -1.196e+00  8.302e-01  -1.440 0.149839    
## SibSp5      -1.603e+01  9.592e+02  -0.017 0.986666    
## SibSp8      -1.633e+01  1.004e+03  -0.016 0.987019    
## Parch1       7.290e-01  3.545e-01   2.056 0.039771 *  
## Parch2       1.406e-01  4.504e-01   0.312 0.754892    
## Parch3       7.919e-01  1.229e+00   0.645 0.519226    
## Parch4      -1.498e+01  1.552e+03  -0.010 0.992300    
## Parch5      -9.772e-03  1.378e+00  -0.007 0.994343    
## Parch6      -1.635e+01  2.400e+03  -0.007 0.994563    
## Fare         3.128e-03  3.091e-03   1.012 0.311605    
## EmbarkedC   -1.398e+01  2.400e+03  -0.006 0.995353    
## EmbarkedQ   -1.387e+01  2.400e+03  -0.006 0.995386    
## EmbarkedS   -1.431e+01  2.400e+03  -0.006 0.995243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 829.60  on 622  degrees of freedom
## Residual deviance: 530.63  on 602  degrees of freedom
## AIC: 572.63
## 
## Number of Fisher Scoring iterations: 15
fitted.probabilities <- predict(final.log.model,final.test,type='response')

fitted.results <- ifelse(fitted.probabilities>0.5,1,0)
misClassError <- mean(fitted.results!=final.test$Survived)
print(1-misClassError)
## [1] 0.8103255
# confusion matrix
table(final.test$Survived,fitted.probabilities>0.5)
##    
##     FALSE TRUE
##   0   475   74
##   1    95  247

Lecture 98: Logistic Regression Project

In this project, we will be owrking with the UCI adult dataset. We will be attempting to predict if people in the data set belong in a certain class by salary, either making <= 50k or >50k per year.

Typically, most of your time is spent cleaning data, not running the few lines of code that build your model, this proect will try to refrlect that by showing different issues that may arise when cleaning data.

Get the data

We first read in the adult.csv file and set it to a data frame called adult.

adult <- read.csv('adult_sal.csv')
head(adult)
##   X age    type_employer fnlwgt education education_num            marital
## 1 1  39        State-gov  77516 Bachelors            13      Never-married
## 2 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3 3  38          Private 215646   HS-grad             9           Divorced
## 4 4  53          Private 234721      11th             7 Married-civ-spouse
## 5 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week       country income
## 1          40 United-States  <=50K
## 2          13 United-States  <=50K
## 3          40 United-States  <=50K
## 4          40 United-States  <=50K
## 5          40          Cuba  <=50K
## 6          40 United-States  <=50K
colnames(adult)
##  [1] "X"             "age"           "type_employer" "fnlwgt"       
##  [5] "education"     "education_num" "marital"       "occupation"   
##  [9] "relationship"  "race"          "sex"           "capital_gain" 
## [13] "capital_loss"  "hr_per_week"   "country"       "income"

The index has be repeated, thus we clean the data set by dropping this column, and check data overview.

library(dplyr)
adult <- select(adult,-X)

# data overview
head(adult)
##   age    type_employer fnlwgt education education_num            marital
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week       country income
## 1          40 United-States  <=50K
## 2          13 United-States  <=50K
## 3          40 United-States  <=50K
## 4          40 United-States  <=50K
## 5          40          Cuba  <=50K
## 6          40 United-States  <=50K
str(adult)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(adult)
##       age                 type_employer       fnlwgt       
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00   State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00   Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)         :  981                    
##         education     education_num                    marital     
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation           relationship                   race      
##  Prof-specialty :4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   :4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial:4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   :3770   Own-child     : 5068   Other             :  271  
##  Sales          :3650   Unmarried     : 3446   White             :27816  
##  Other-service  :3295   Wife          : 1568                             
##  (Other)        :9541                                                    
##      sex         capital_gain    capital_loss     hr_per_week   
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##           country        income     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  ?            :  583                
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709

Data Cleaning

Notice that we have a lot of columns that are categorical factors, however, a lot of these columns have too many factors than may be necessary. In this data cleaning section, we’ll try to clean these columns up by reducing the number of factors.

Type_employer column

First, we use table() to check the frequency of the type_employer column. - The number of Null values in the type_employer: 1836 - Two smallest groups: Never-worked, Without-pay

table(adult$type_employer)
## 
##                ?      Federal-gov        Local-gov     Never-worked 
##             1836              960             2093                7 
##          Private     Self-emp-inc Self-emp-not-inc        State-gov 
##            22696             1116             2541             1298 
##      Without-pay 
##               14

We combine these two smallest groups into a single group called “Unemployed”. There are lots of ways to do this, so feel free to get creative. Hint: it maybe helpful to convert these objects into character data types (as.character() and then use sapply with a custom function)

unemp <- function(job){
  job <- as.character(job)
  if (job=='Never-worked' | job=='Without-pay'){
    return('Unemployed')
  } else{}
  return(job)
}
adult$type_employer <- sapply(adult$type_employer,unemp)
table(adult$type_employer)
## 
##                ?      Federal-gov        Local-gov          Private 
##             1836              960             2093            22696 
##     Self-emp-inc Self-emp-not-inc        State-gov       Unemployed 
##             1116             2541             1298               21

What other columns are suitable for combining? Combine state and local gov jobs into a category called SL-gov and combine self-employed jobs into a categorry called self-emp.

group_emp <- function(job){
  if(job=='local-gov' |job=='State-gov'){
    return('SL-gov')
  } else if (job=='Self-emp-inc' | job=='Self-emp-not-inc'){
    return('self-emp')
  }else{
    return(job)
  }
}

adult$type_employer <- sapply(adult$type_employer,group_emp)
table(adult$type_employer)
## 
##           ? Federal-gov   Local-gov     Private    self-emp      SL-gov 
##        1836         960        2093       22696        3657        1298 
##  Unemployed 
##          21

Marital Column

Next, we use table() look at the marital column.

table(adult$marital)
## 
##              Divorced     Married-AF-spouse    Married-civ-spouse 
##                  4443                    23                 14976 
## Married-spouse-absent         Never-married             Separated 
##                   418                 10683                  1025 
##               Widowed 
##                   993

let’s reduce this to three groups:

group_marital <- function(marital){
  marrital <- as.character(marital)
  # not-married
  if(marital=='Separated' | marital=='Divorced' | marital=='Widowed'){
    return('Non-Married')
  # never-married
  } else if (marital=='Never-married'){
   return('Never-Married')
    # married
  } else{
    return('Married')
  }
}

adult$marital <- sapply(adult$marital,group_marital)
table(adult$marital)
## 
##       Married Never-Married   Non-Married 
##         15417         10683          6461

Country Column

We check the country column using table()

table(adult$country)
## 
##                          ?                   Cambodia 
##                        583                         19 
##                     Canada                      China 
##                        121                         75 
##                   Columbia                       Cuba 
##                         59                         95 
##         Dominican-Republic                    Ecuador 
##                         70                         28 
##                El-Salvador                    England 
##                        106                         90 
##                     France                    Germany 
##                         29                        137 
##                     Greece                  Guatemala 
##                         29                         64 
##                      Haiti         Holand-Netherlands 
##                         44                          1 
##                   Honduras                       Hong 
##                         13                         20 
##                    Hungary                      India 
##                         13                        100 
##                       Iran                    Ireland 
##                         43                         24 
##                      Italy                    Jamaica 
##                         73                         81 
##                      Japan                       Laos 
##                         62                         18 
##                     Mexico                  Nicaragua 
##                        643                         34 
## Outlying-US(Guam-USVI-etc)                       Peru 
##                         14                         31 
##                Philippines                     Poland 
##                        198                         60 
##                   Portugal                Puerto-Rico 
##                         37                        114 
##                   Scotland                      South 
##                         12                         80 
##                     Taiwan                   Thailand 
##                         51                         18 
##            Trinadad&Tobago              United-States 
##                         19                      29170 
##                    Vietnam                 Yugoslavia 
##                         67                         16

Group these contires together however we see fit. We have flexibilityhere because there is no right/wrong way to do this, possible group by continents. We should be able to reduce the number of groups here significantly though.

levels(adult$country)
##  [1] "?"                          "Cambodia"                  
##  [3] "Canada"                     "China"                     
##  [5] "Columbia"                   "Cuba"                      
##  [7] "Dominican-Republic"         "Ecuador"                   
##  [9] "El-Salvador"                "England"                   
## [11] "France"                     "Germany"                   
## [13] "Greece"                     "Guatemala"                 
## [15] "Haiti"                      "Holand-Netherlands"        
## [17] "Honduras"                   "Hong"                      
## [19] "Hungary"                    "India"                     
## [21] "Iran"                       "Ireland"                   
## [23] "Italy"                      "Jamaica"                   
## [25] "Japan"                      "Laos"                      
## [27] "Mexico"                     "Nicaragua"                 
## [29] "Outlying-US(Guam-USVI-etc)" "Peru"                      
## [31] "Philippines"                "Poland"                    
## [33] "Portugal"                   "Puerto-Rico"               
## [35] "Scotland"                   "South"                     
## [37] "Taiwan"                     "Thailand"                  
## [39] "Trinadad&Tobago"            "United-States"             
## [41] "Vietnam"                    "Yugoslavia"
Asia <- c('China','Hong','India','Iran','Cambodia','Japan', 'Laos' ,
          'Philippines' ,'Vietnam' ,'Taiwan', 'Thailand')

North.America <- c('Canada','United-States','Puerto-Rico' )

Europe <- c('England' ,'France', 'Germany' ,'Greece','Holand-Netherlands','Hungary',
            'Ireland','Italy','Poland','Portugal','Scotland','Yugoslavia')

Latin.and.South.America <- c('Columbia','Cuba','Dominican-Republic','Ecuador',
                             'El-Salvador','Guatemala','Haiti','Honduras',
                             'Mexico','Nicaragua','Outlying-US(Guam-USVI-etc)','Peru',
                            'Jamaica','Trinadad&Tobago')
Other <- c('South')
library(magrittr)
group_country <- function(ctry){
    if (ctry %in% Asia){
        return('Asia')
    }else if (ctry %in% North.America){
        return('North.America')
    }else if (ctry %in% Europe){
        return('Europe')
    }else if (ctry %in% Latin.and.South.America){
        return('Latin.and.South.America')
    }else{
        return('Other')      
    }
}
adult$country <- sapply(adult$country,group_country)
table(adult$country)
## 
##                    Asia                  Europe Latin.and.South.America 
##                     671                     521                    1301 
##           North.America                   Other 
##                   29405                     663

Once columns are cleared, we check the str() of adult again, and make sure any of the columns we changed have factor levels with factor().

adult$type_employer <- as.factor(adult$type_employer)
adult$country <- sapply(adult$country,factor)
adult$marital <- sapply(adult$marital,factor)

str(adult)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 7 levels "?","Federal-gov",..: 6 5 4 4 4 4 4 5 4 4 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
##  $ occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

Mssing data

Notice how we have data that is missing.

We forst convert any celss with a ‘?’ or a ‘?’ value to a NA value. Hint: is.na maybe useful here or we can also use brackets with a conditional statement.

library(Amelia)
## Warning: package 'Amelia' was built under R version 3.4.4
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
adult[adult=='?'] <- NA

Using table() on a column with NA values, we confirm those NA values are not shown, instead we will just see 0 for ?.

table(adult$type_emloyer)
## < table of extent 0 >
adult$type_employer <- sapply(adult$type_employer,factor)
adult$country <- sapply(adult$country,factor)
adult$marital <- sapply(adult$marital,factor)
adult$occupation <- sapply(adult$occupation,factor)

We could have also done something like the below. adult\(type_employer = factor(adult\)type_employer)

We check the distribution of missing values.USing missmap, we notice that heatmap pointing out missing values(NA). This gives us a quick glance at how much data is missing, in this case, not a whole lot (relatively speaking). We probably also notice that there is a bunch of y labels, get rid of them by running the command below. What is col=c(‘yellow’,‘black’) doing?

library(Amelia)
missmap(adult)

missmap(adult,y.at=c(1),y.labels = c(''),col=c('yellow','black'))

Next, we use na.omit() to omit NA data from the adult data frame. Note, it really depends on the situation and the data to judge whether or not this is a good decision. We shouldn’t always drop NA values.

adult <- na.omit(adult)
str(adult)
## 'data.frame':    30718 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
##  $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
##   .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...

We again use missmap() to check that all the NA values were in fact dropped.

missmap(adult,y.at=c(1),y.labels=c(""),col=c("yellow","black"))

EDA

Although we’ve cleaned the data, we still have explored it using visualization. We use ggplot2 to create a histogram of ages, colored by income.

str(adult)
## 'data.frame':    30718 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
##  $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
##   .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...
library(ggplot2)
library(dplyr)

ggplot(adult,aes(x=age))+geom_histogram(aes(fill=income),color='black',binwidth=1)+theme_bw()

We plot a histgram of hours worked per week.

ggplot(adult,aes(x=hr_per_week))+geom_histogram(fill="light blue")+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We rename the country column to better reflect factor levels.

# lots of ways to do this, could use dplyr as well

names(adult)[names(adult)=="country"] <- "region"
str(adult)
## 'data.frame':    30718 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 6 levels "SL-gov","self-emp",..: 1 2 3 3 3 3 3 2 3 3 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 3 levels "Never-Married",..: 1 2 3 2 2 2 2 2 1 2 ...
##  $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 2 3 3 4 2 5 2 4 2 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ region       : Factor w/ 5 levels "North.America",..: 1 1 1 1 2 1 2 1 1 1 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1843] 28 62 70 78 107 129 150 155 161 188 ...
##   .. ..- attr(*, "names")= chr [1:1843] "28" "62" "70" "78" ...

We create a barplot of region with the fill color defined by income class. Optional: Figure out how rotate the x axis text for readability.

ggplot(adult,aes(x=region))+geom_bar(aes(fill=income),color='black')+theme_bw()+theme(axis.text.x=element_text(angle=90,hjust=1))

Buliding a Model

Now it’s time to build a model to classify people into two groups: above or below 50k in Salary.

Logistic Regression

Details should be refered to the lecture of ISLR if we are fuzzy on any of this.

Logistic regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical depenedent variables, using one or more independent variables. The independent variables can be categorical or numerical.

Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logstic

We take a quick loo at the head() of adult to make sure we have a good overview before going into building the model.

head(adult)
##   age type_employer fnlwgt education education_num       marital
## 1  39        SL-gov  77516 Bachelors            13 Never-Married
## 2  50      self-emp  83311 Bachelors            13       Married
## 3  38       Private 215646   HS-grad             9   Non-Married
## 4  53       Private 234721      11th             7       Married
## 5  28       Private 338409 Bachelors            13       Married
## 6  37       Private 284582   Masters            14       Married
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week                  region income
## 1          40           North.America  <=50K
## 2          13           North.America  <=50K
## 3          40           North.America  <=50K
## 4          40           North.America  <=50K
## 5          40 Latin.and.South.America  <=50K
## 6          40           North.America  <=50K

Train test split

As a first step, we split the data into a train and test set using the caTools library as done in the previous lectures. Reference previous solution notebooks if we need a refresher.

# caTools library
library(caTools)

# set a randome seed 
set.seed(101)

# spliit up the sample, basically randomlly assigns a booleans to a new column sample
sample <- sample.split(adult$income,SplitRatio = 0.7)

# Training data
train = subset(adult,sample==TRUE)
# Testing data
test=subset(adult,sample==FALSE)

head(train)
##   age type_employer fnlwgt education education_num       marital
## 1  39        SL-gov  77516 Bachelors            13 Never-Married
## 2  50      self-emp  83311 Bachelors            13       Married
## 4  53       Private 234721      11th             7       Married
## 5  28       Private 338409 Bachelors            13       Married
## 6  37       Private 284582   Masters            14       Married
## 7  49       Private 160187       9th             5       Married
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
## 7     Other-service Not-in-family Black Female            0            0
##   hr_per_week                  region income
## 1          40           North.America  <=50K
## 2          13           North.America  <=50K
## 4          40           North.America  <=50K
## 5          40 Latin.and.South.America  <=50K
## 6          40           North.America  <=50K
## 7          16 Latin.and.South.America  <=50K
head(test)
##    age type_employer fnlwgt education education_num       marital
## 3   38       Private 215646   HS-grad             9   Non-Married
## 15  40       Private 121772 Assoc-voc            11       Married
## 17  25      self-emp 176756   HS-grad             9 Never-Married
## 18  32       Private 186824   HS-grad             9 Never-Married
## 19  38       Private  28887      11th             7       Married
## 22  54       Private 302146   HS-grad             9   Non-Married
##           occupation  relationship               race    sex capital_gain
## 3  Handlers-cleaners Not-in-family              White   Male            0
## 15      Craft-repair       Husband Asian-Pac-Islander   Male            0
## 17   Farming-fishing     Own-child              White   Male            0
## 18 Machine-op-inspct     Unmarried              White   Male            0
## 19             Sales       Husband              White   Male            0
## 22     Other-service     Unmarried              Black Female            0
##    capital_loss hr_per_week        region income
## 3             0          40 North.America  <=50K
## 15            0          40         Other   >50K
## 17            0          35 North.America  <=50K
## 18            0          40 North.America  <=50K
## 19            0          50 North.America  <=50K
## 22            0          20 North.America  <=50K

Fitting generalized linear models

Description

glm is used to fit generalized linear models, specified by giving a symolic description of the linear predictor and a description of the error distribution.

Usage

glm(formula,family=gaussian,data,weights,subset…)

Details

A typical predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with any duplicates removed.

A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula.

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.

glm.fit is the workhorse function: it is not normally called directly but can be more efficient where the response vector, design matrix and family have already been calculated.

If more than one of etastart, start and mustart is specified, the first in the list will be used. It is often advisable to supply starting values for a quasi family, and also for families with unusual links such as gaussian(“log”).

All of weights, subset, offset, etastart and mustart are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

For the background to warning messages about ‘fitted probabilities numerically 0 or 1 occurred’ for binomial GLMs, see Venables & Ripley (2002, pp. 197–8).

Values

glm returns an object of class inheriting from “glm” which inherits from the class “lm”. See later in this section. If a non-standard method is used, the object will also inherit from the class (if any) returned by that function.

The function summary (i.e., summary.glm) can be used to obtain or print a summary of the results and the function anova (i.e., anova.glm) to produce an analysis of variance table.

The generic accessor functions coefficients, effects, fitted.values and residuals can be used to extract various useful features of the value returned by glm.

weights extracts a vector of weights, one for each case in the fit (after subsetting and na.action).

An object of class “glm” is a list containing at least the following components: - coefficient - residual - fitted.values - rank - family - linear.predictors - deviance - aic - null.deviance…

In addition, non-empty fits will have components qr, R and effects relating to the final weighted linear fit.

Objects of class “glm” are normally of class c(“glm”,“lm”), that is inherit from class “lm”, and well-designed methods for class “lm” will be applied to the weighted linear model at the final iteration of IWLS. However, care is needed, as extractor functions for class “glm” such as residuals and weights do not just pick out the component of the fit with the same name.

If a binomial glm model was specified by giving a two-column response, the weights returned by prior.weights are the total numbers of cases (factored by the supplied case weights) and the component y of the result is the proportion of successes.

Fitting functions

The argument method serves two purposes. One is to allow the model frame to be recreated with no fitting. The other is to allow the default fitting function glm.fit to be replaced by a function which takes the same arguments and uses a different fitting algorithm. If glm.fit is supplied as a character string it is used to search for a function of that name, starting in the stats namespace.

The class of the object return by the fitter (if any) will be prepended to the class returned by glm.

Application of glm model

We use all the features to train a glm() model on the training data set, pass the argument family=binomial(logit) into the glm function.

adult.model = glm(income ~., family=binomial(logit),data=train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(adult.model)
## 
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1169  -0.5168  -0.1966   0.0000   3.6269  
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -7.346e+00  4.307e-01 -17.056  < 2e-16 ***
## age                            2.536e-02  2.008e-03  12.629  < 2e-16 ***
## type_employerself-emp         -1.266e-02  1.208e-01  -0.105 0.916508    
## type_employerPrivate           2.170e-01  1.088e-01   1.994 0.046190 *  
## type_employerFederal-gov       6.634e-01  1.500e-01   4.424 9.70e-06 ***
## type_employerLocal-gov        -3.216e-02  1.286e-01  -0.250 0.802480    
## type_employerUnemployed       -1.348e+01  3.688e+02  -0.037 0.970843    
## fnlwgt                         5.427e-07  2.085e-07   2.603 0.009247 ** 
## education11th                  2.094e-01  2.570e-01   0.815 0.415332    
## education12th                  3.928e-01  3.410e-01   1.152 0.249272    
## education1st-4th              -4.591e-01  6.067e-01  -0.757 0.449238    
## education5th-6th              -8.027e-02  3.980e-01  -0.202 0.840166    
## education7th-8th              -4.985e-01  2.880e-01  -1.731 0.083453 .  
## education9th                  -1.210e-02  3.191e-01  -0.038 0.969755    
## educationAssoc-acdm            1.251e+00  2.165e-01   5.778 7.56e-09 ***
## educationAssoc-voc             1.452e+00  2.084e-01   6.969 3.18e-12 ***
## educationBachelors             2.003e+00  1.938e-01  10.338  < 2e-16 ***
## educationDoctorate             2.869e+00  2.641e-01  10.863  < 2e-16 ***
## educationHS-grad               8.360e-01  1.889e-01   4.427 9.56e-06 ***
## educationMasters               2.348e+00  2.064e-01  11.376  < 2e-16 ***
## educationPreschool            -1.879e+01  1.645e+02  -0.114 0.909043    
## educationProf-school           2.797e+00  2.468e-01  11.336  < 2e-16 ***
## educationSome-college          1.203e+00  1.915e-01   6.283 3.31e-10 ***
## education_num                         NA         NA      NA       NA    
## maritalMarried                 1.280e+00  1.943e-01   6.589 4.43e-11 ***
## maritalNon-Married             5.435e-01  9.953e-02   5.461 4.74e-08 ***
## occupationExec-managerial      7.689e-01  9.095e-02   8.454  < 2e-16 ***
## occupationHandlers-cleaners   -7.934e-01  1.726e-01  -4.596 4.30e-06 ***
## occupationProf-specialty       4.964e-01  9.630e-02   5.155 2.54e-07 ***
## occupationOther-service       -8.240e-01  1.386e-01  -5.945 2.77e-09 ***
## occupationSales                2.899e-01  9.750e-02   2.974 0.002943 ** 
## occupationCraft-repair         4.214e-02  9.486e-02   0.444 0.656895    
## occupationTransport-moving    -1.106e-01  1.190e-01  -0.929 0.352705    
## occupationFarming-fishing     -1.120e+00  1.619e-01  -6.918 4.58e-12 ***
## occupationMachine-op-inspct   -2.190e-01  1.203e-01  -1.821 0.068616 .  
## occupationTech-support         6.827e-01  1.325e-01   5.152 2.58e-07 ***
## occupationProtective-serv      6.056e-01  1.494e-01   4.053 5.07e-05 ***
## occupationArmed-Forces        -6.250e-01  1.844e+00  -0.339 0.734654    
## occupationPriv-house-serv     -3.601e+00  1.938e+00  -1.858 0.063212 .  
## relationshipNot-in-family     -8.662e-01  1.907e-01  -4.541 5.59e-06 ***
## relationshipOther-relative    -1.086e+00  2.546e-01  -4.266 1.99e-05 ***
## relationshipOwn-child         -1.797e+00  2.357e-01  -7.624 2.46e-14 ***
## relationshipUnmarried         -1.030e+00  2.154e-01  -4.782 1.74e-06 ***
## relationshipWife               1.475e+00  1.235e-01  11.948  < 2e-16 ***
## raceAsian-Pac-Islander         6.069e-01  3.207e-01   1.893 0.058403 .  
## raceBlack                      4.534e-01  2.848e-01   1.592 0.111327    
## raceOther                      4.236e-02  4.217e-01   0.100 0.920005    
## raceWhite                      6.599e-01  2.712e-01   2.434 0.014948 *  
## sexMale                        8.847e-01  9.382e-02   9.430  < 2e-16 ***
## capital_gain                   3.193e-04  1.273e-05  25.077  < 2e-16 ***
## capital_loss                   6.551e-04  4.562e-05  14.359  < 2e-16 ***
## hr_per_week                    2.907e-02  1.988e-03  14.624  < 2e-16 ***
## regionLatin.and.South.America -5.925e-01  1.595e-01  -3.714 0.000204 ***
## regionAsia                    -6.469e-02  2.044e-01  -0.316 0.751670    
## regionOther                   -4.299e-01  1.651e-01  -2.604 0.009217 ** 
## regionEurope                   4.420e-02  1.552e-01   0.285 0.775868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24138  on 21502  degrees of freedom
## Residual deviance: 14004  on 21448  degrees of freedom
## AIC: 14114
## 
## Number of Fisher Scoring iterations: 14

We have still a lot of features. Some important, some not much. R comes with an awesome function called step(). The step() function iteratively tries to remove predictor variables from the model in an attempt to delete variables that do not significatnyl add to the fit. How does it do this? It uses AIC.

Choose a model by AIC in a stepwise algorithm

step() funciton seletcts a formula based model by AIC.

Usage: step(object, scope, scale = 0, direction = c(“both”, “backward”, “forward”), trace = 1, keep = NULL, steps = 1000, k = 2, …

Details

tep uses add1 and drop1 repeatedly; it will work for any method for which they work, and that is determined by having a valid method for extractAIC. When the additive constant can be chosen so that AIC is equal to Mallows’ Cp, this is done and the tables are labelled appropriately.

The set of models searched is determined by the scope argument. The right-hand-side of its lower component is always included in the model, and right-hand-side of the model is included in the upper component. If scope is a single formula, it specifies the upper component, and the lower model is empty. If scope is missing, the initial model is used as the upper model.

Models specified by scope can be templates to update object as used by update.formula. So using . in a scope formula means ‘what is already there’, with .^2 indicating all interactions of existing terms.

There is a potential problem in using glm fits with a variable scale, as in that case the deviance is not simply related to the maximized log-likelihood. The “glm” method for function extractAIC makes the appropriate adjustment for a gaussian family, but may need to be amended for other cases. (The binomial and poisson families have fixed scale by default and do not correspond to a particular maximum-likelihood problem for variable scale.)

Values

the stepwise-selected model is returned, with up to two additional components. There is an “anova” component corresponding to the steps taken in the search, as well as a “keep” component if the keep= argument was supplied in the call. The “Resid. Dev” column of the analysis of deviance table refers to a constant minus twice the maximized log likelihood: it will be a deviance only in cases where a saturated model is well-defined (thus excluding lm, aov and survreg fits, for example).

Warning

The model fitting must apply the models to the same dataset. This may be a problem if there are missing values and R’s default of na.action = na.omit is used. We suggest you remove the missing values first.

Calls to the function nobs are used to check that the number of observations involved in the fitting process remains unchanged.

Note

This function differs considerably from the function in S, which uses a number of approximations and does not in general compute the correct AIC.

This is a minimal implementation. Use stepAIC in package MASS for a wider range of object classes.

Application of AIC to the model

We use new.model <- step(model.name) to use the step() function to create a new model.

new.step.model <- step(adult.model)
## Start:  AIC=14113.99
## income ~ age + type_employer + fnlwgt + education + education_num + 
##     marital + occupation + relationship + race + sex + capital_gain + 
##     capital_loss + hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=14113.99
## income ~ age + type_employer + fnlwgt + education + marital + 
##     occupation + relationship + race + sex + capital_gain + capital_loss + 
##     hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                 Df Deviance   AIC
## <none>                14004 14114
## - fnlwgt         1    14011 14119
## - race           4    14019 14121
## - region         4    14026 14128
## - type_employer  5    14050 14150
## - marital        2    14060 14166
## - sex            1    14096 14204
## - age            1    14165 14273
## - capital_loss   1    14217 14325
## - hr_per_week    1    14222 14330
## - relationship   5    14288 14388
## - occupation    13    14443 14527
## - education     15    14718 14798
## - capital_gain   1    15248 15356
summary(new.step.model)
## 
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education + 
##     marital + occupation + relationship + race + sex + capital_gain + 
##     capital_loss + hr_per_week + region, family = binomial(logit), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1169  -0.5168  -0.1966   0.0000   3.6269  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -7.346e+00  4.307e-01 -17.056  < 2e-16 ***
## age                            2.536e-02  2.008e-03  12.629  < 2e-16 ***
## type_employerself-emp         -1.266e-02  1.208e-01  -0.105 0.916508    
## type_employerPrivate           2.170e-01  1.088e-01   1.994 0.046190 *  
## type_employerFederal-gov       6.634e-01  1.500e-01   4.424 9.70e-06 ***
## type_employerLocal-gov        -3.216e-02  1.286e-01  -0.250 0.802480    
## type_employerUnemployed       -1.348e+01  3.688e+02  -0.037 0.970843    
## fnlwgt                         5.427e-07  2.085e-07   2.603 0.009247 ** 
## education11th                  2.094e-01  2.570e-01   0.815 0.415332    
## education12th                  3.928e-01  3.410e-01   1.152 0.249272    
## education1st-4th              -4.591e-01  6.067e-01  -0.757 0.449238    
## education5th-6th              -8.027e-02  3.980e-01  -0.202 0.840166    
## education7th-8th              -4.985e-01  2.880e-01  -1.731 0.083453 .  
## education9th                  -1.210e-02  3.191e-01  -0.038 0.969755    
## educationAssoc-acdm            1.251e+00  2.165e-01   5.778 7.56e-09 ***
## educationAssoc-voc             1.452e+00  2.084e-01   6.969 3.18e-12 ***
## educationBachelors             2.003e+00  1.938e-01  10.338  < 2e-16 ***
## educationDoctorate             2.869e+00  2.641e-01  10.863  < 2e-16 ***
## educationHS-grad               8.360e-01  1.889e-01   4.427 9.56e-06 ***
## educationMasters               2.348e+00  2.064e-01  11.376  < 2e-16 ***
## educationPreschool            -1.879e+01  1.645e+02  -0.114 0.909043    
## educationProf-school           2.797e+00  2.468e-01  11.336  < 2e-16 ***
## educationSome-college          1.203e+00  1.915e-01   6.283 3.31e-10 ***
## maritalMarried                 1.280e+00  1.943e-01   6.589 4.43e-11 ***
## maritalNon-Married             5.435e-01  9.953e-02   5.461 4.74e-08 ***
## occupationExec-managerial      7.689e-01  9.095e-02   8.454  < 2e-16 ***
## occupationHandlers-cleaners   -7.934e-01  1.726e-01  -4.596 4.30e-06 ***
## occupationProf-specialty       4.964e-01  9.630e-02   5.155 2.54e-07 ***
## occupationOther-service       -8.240e-01  1.386e-01  -5.945 2.77e-09 ***
## occupationSales                2.899e-01  9.750e-02   2.974 0.002943 ** 
## occupationCraft-repair         4.214e-02  9.486e-02   0.444 0.656895    
## occupationTransport-moving    -1.106e-01  1.190e-01  -0.929 0.352705    
## occupationFarming-fishing     -1.120e+00  1.619e-01  -6.918 4.58e-12 ***
## occupationMachine-op-inspct   -2.190e-01  1.203e-01  -1.821 0.068616 .  
## occupationTech-support         6.827e-01  1.325e-01   5.152 2.58e-07 ***
## occupationProtective-serv      6.056e-01  1.494e-01   4.053 5.07e-05 ***
## occupationArmed-Forces        -6.250e-01  1.844e+00  -0.339 0.734654    
## occupationPriv-house-serv     -3.601e+00  1.938e+00  -1.858 0.063212 .  
## relationshipNot-in-family     -8.662e-01  1.907e-01  -4.541 5.59e-06 ***
## relationshipOther-relative    -1.086e+00  2.546e-01  -4.266 1.99e-05 ***
## relationshipOwn-child         -1.797e+00  2.357e-01  -7.624 2.46e-14 ***
## relationshipUnmarried         -1.030e+00  2.154e-01  -4.782 1.74e-06 ***
## relationshipWife               1.475e+00  1.235e-01  11.948  < 2e-16 ***
## raceAsian-Pac-Islander         6.069e-01  3.207e-01   1.893 0.058403 .  
## raceBlack                      4.534e-01  2.848e-01   1.592 0.111327    
## raceOther                      4.236e-02  4.217e-01   0.100 0.920005    
## raceWhite                      6.599e-01  2.712e-01   2.434 0.014948 *  
## sexMale                        8.847e-01  9.382e-02   9.430  < 2e-16 ***
## capital_gain                   3.193e-04  1.273e-05  25.077  < 2e-16 ***
## capital_loss                   6.551e-04  4.562e-05  14.359  < 2e-16 ***
## hr_per_week                    2.907e-02  1.988e-03  14.624  < 2e-16 ***
## regionLatin.and.South.America -5.925e-01  1.595e-01  -3.714 0.000204 ***
## regionAsia                    -6.469e-02  2.044e-01  -0.316 0.751670    
## regionOther                   -4.299e-01  1.651e-01  -2.604 0.009217 ** 
## regionEurope                   4.420e-02  1.552e-01   0.285 0.775868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24138  on 21502  degrees of freedom
## Residual deviance: 14004  on 21448  degrees of freedom
## AIC: 14114
## 
## Number of Fisher Scoring iterations: 14

We notice that the step() function kept all the features used previously.While we used the AIC criteria to compare models, there are other criteria we could have used. If you want you can try reading about the variable inflation factor (VIF) and vif() function to explore other options for comparison criteria. In the meantime, let’s continue on and see how well our model performed against the test set.

We create a confusion matrix using the predict function with type=“response” as an argument inside of that function.

test$predicted.income=predict(adult.model,newdata=test,type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
table(test$income,test$predicted.income>0.5)
##        
##         FALSE TRUE
##   <=50K  6372  548
##   >50K    873 1422

The accuracy of the model is

(6372+1422) / (6372+1423+548+872)
## [1] 0.8457949

Other measures of performance, recall or precision are:

# recall
6372/(6372+548)
## [1] 0.9208092
# precision
6372 / (6372+872)
## [1] 0.8796245

LEcture 102: K Nearest Neighbors Introduction

Reference: ISLR Chapter 4

K Nearest Neighbors is a classification algorithm that operates on avery simple principle. It is best shown through example. Imagine we have some imaginary data on Dogs and Horses, with heights and weights.

Training algorith: 1. Store all the data

Prediction algorithm: 1. Calculate the distance from x to all points in the data 2. Sort the points in the data by increasing distance from x 3. Predict the majority label of the “k” closest points

Choosing a K will affect what class a new point is assigned

Pros - very simple - training is trivial - works with any number of lcasses - Easy to add more data - Few parameters: K and Distance metric

Cons - high prediction costs - not good with high dimensional data - categorical feastures don’t work well

Lecture 103: K Nearest Neighbors (KNN) with R

Get the data

Dataset: Caravan from ISLR package

# install.packages("ISLR")
library(ISLR)

summary(Caravan)
##     MOSTYPE         MAANTHUI         MGEMOMV         MGEMLEEF    
##  Min.   : 1.00   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.: 1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :30.00   Median : 1.000   Median :3.000   Median :3.000  
##  Mean   :24.25   Mean   : 1.111   Mean   :2.679   Mean   :2.991  
##  3rd Qu.:35.00   3rd Qu.: 1.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :41.00   Max.   :10.000   Max.   :5.000   Max.   :6.000  
##     MOSHOOFD          MGODRK           MGODPR          MGODOV    
##  Min.   : 1.000   Min.   :0.0000   Min.   :0.000   Min.   :0.00  
##  1st Qu.: 3.000   1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:0.00  
##  Median : 7.000   Median :0.0000   Median :5.000   Median :1.00  
##  Mean   : 5.774   Mean   :0.6965   Mean   :4.627   Mean   :1.07  
##  3rd Qu.: 8.000   3rd Qu.:1.0000   3rd Qu.:6.000   3rd Qu.:2.00  
##  Max.   :10.000   Max.   :9.0000   Max.   :9.000   Max.   :5.00  
##      MGODGE          MRELGE          MRELSA           MRELOV    
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:2.000   1st Qu.:5.000   1st Qu.:0.0000   1st Qu.:1.00  
##  Median :3.000   Median :6.000   Median :1.0000   Median :2.00  
##  Mean   :3.259   Mean   :6.183   Mean   :0.8835   Mean   :2.29  
##  3rd Qu.:4.000   3rd Qu.:7.000   3rd Qu.:1.0000   3rd Qu.:3.00  
##  Max.   :9.000   Max.   :9.000   Max.   :7.0000   Max.   :9.00  
##     MFALLEEN        MFGEKIND       MFWEKIND      MOPLHOOG    
##  Min.   :0.000   Min.   :0.00   Min.   :0.0   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:2.00   1st Qu.:3.0   1st Qu.:0.000  
##  Median :2.000   Median :3.00   Median :4.0   Median :1.000  
##  Mean   :1.888   Mean   :3.23   Mean   :4.3   Mean   :1.461  
##  3rd Qu.:3.000   3rd Qu.:4.00   3rd Qu.:6.0   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.00   Max.   :9.0   Max.   :9.000  
##     MOPLMIDD        MOPLLAAG        MBERHOOG        MBERZELF    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.000   Median :5.000   Median :2.000   Median :0.000  
##  Mean   :3.351   Mean   :4.572   Mean   :1.895   Mean   :0.398  
##  3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :5.000  
##     MBERBOER         MBERMIDD        MBERARBG       MBERARBO    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :0.0000   Median :3.000   Median :2.00   Median :2.000  
##  Mean   :0.5223   Mean   :2.899   Mean   :2.22   Mean   :2.306  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :9.00   Max.   :9.000  
##       MSKA           MSKB1           MSKB2            MSKC      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :2.000   Median :2.000   Median :4.000  
##  Mean   :1.621   Mean   :1.607   Mean   :2.203   Mean   :3.759  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##       MSKD           MHHUUR          MHKOOP          MAUT1     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:5.00  
##  Median :1.000   Median :4.000   Median :5.000   Median :6.00  
##  Mean   :1.067   Mean   :4.237   Mean   :4.772   Mean   :6.04  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.00  
##      MAUT2           MAUT0          MZFONDS          MZPART     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:5.000   1st Qu.:1.000  
##  Median :1.000   Median :2.000   Median :7.000   Median :2.000  
##  Mean   :1.316   Mean   :1.959   Mean   :6.277   Mean   :2.729  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:8.000   3rd Qu.:4.000  
##  Max.   :7.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##     MINKM30         MINK3045        MINK4575        MINK7512     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :4.000   Median :3.000   Median :0.0000  
##  Mean   :2.574   Mean   :3.536   Mean   :2.731   Mean   :0.7961  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:1.0000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.0000  
##     MINK123M         MINKGEM         MKOOPKLA        PWAPART      
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0.0000  
##  Median :0.0000   Median :4.000   Median :4.000   Median :0.0000  
##  Mean   :0.2027   Mean   :3.784   Mean   :4.236   Mean   :0.7712  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:2.0000  
##  Max.   :9.0000   Max.   :9.000   Max.   :8.000   Max.   :3.0000  
##     PWABEDR           PWALAND           PPERSAUT       PBESAUT       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :5.00   Median :0.00000  
##  Mean   :0.04002   Mean   :0.07162   Mean   :2.97   Mean   :0.04827  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:6.00   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :4.00000   Max.   :8.00   Max.   :7.00000  
##     PMOTSCO          PVRAAUT            PAANHANG          PTRACTOR      
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.1754   Mean   :0.009447   Mean   :0.02096   Mean   :0.09258  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :7.0000   Max.   :9.000000   Max.   :5.00000   Max.   :6.00000  
##      PWERKT            PBROM           PLEVEN          PPERSONG      
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000   Median :0.0000   Median :0.00000  
##  Mean   :0.01305   Mean   :0.215   Mean   :0.1948   Mean   :0.01374  
##  3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :6.000   Max.   :9.0000   Max.   :6.00000  
##     PGEZONG           PWAOREG            PBRAND         PZEILPL         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.00000   Median :2.000   Median :0.0000000  
##  Mean   :0.01529   Mean   :0.02353   Mean   :1.828   Mean   :0.0008588  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.:0.0000000  
##  Max.   :3.00000   Max.   :7.00000   Max.   :8.000   Max.   :3.0000000  
##     PPLEZIER           PFIETS           PINBOED           PBYSTAND      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.01889   Mean   :0.02525   Mean   :0.01563   Mean   :0.04758  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :1.00000   Max.   :6.00000   Max.   :5.00000  
##     AWAPART         AWABEDR           AWALAND           APERSAUT     
##  Min.   :0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.403   Mean   :0.01477   Mean   :0.02061   Mean   :0.5622  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :5.00000   Max.   :1.00000   Max.   :7.0000  
##     ABESAUT           AMOTSCO           AVRAAUT            AAANHANG      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.01048   Mean   :0.04105   Mean   :0.002233   Mean   :0.01254  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :8.00000   Max.   :3.000000   Max.   :3.00000  
##     ATRACTOR           AWERKT             ABROM             ALEVEN       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.03367   Mean   :0.006183   Mean   :0.07042   Mean   :0.07661  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :6.000000   Max.   :2.00000   Max.   :8.00000  
##     APERSONG           AGEZONG            AWAOREG             ABRAND      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :1.0000  
##  Mean   :0.005325   Mean   :0.006527   Mean   :0.004638   Mean   :0.5701  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :2.000000   Max.   :7.0000  
##     AZEILPL             APLEZIER            AFIETS       
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.000000   Median :0.00000  
##  Mean   :0.0005153   Mean   :0.006012   Mean   :0.03178  
##  3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :2.000000   Max.   :3.00000  
##     AINBOED            ABYSTAND       Purchase  
##  Min.   :0.000000   Min.   :0.00000   No :5474  
##  1st Qu.:0.000000   1st Qu.:0.00000   Yes: 348  
##  Median :0.000000   Median :0.00000             
##  Mean   :0.007901   Mean   :0.01426             
##  3rd Qu.:0.000000   3rd Qu.:0.00000             
##  Max.   :2.000000   Max.   :2.00000
any(is.na(Caravan))
## [1] FALSE
library(Amelia)
missmap(Caravan)

head(Caravan)
##   MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE
## 1      33        1       3        2        8      0      5      1      3
## 2      37        1       2        2        8      1      4      1      4
## 3      37        1       2        2        8      0      4      2      4
## 4       9        1       3        3        3      2      3      2      4
## 5      40        1       4        2       10      1      4      1      4
## 6      23        1       2        1        5      0      5      0      5
##   MRELGE MRELSA MRELOV MFALLEEN MFGEKIND MFWEKIND MOPLHOOG MOPLMIDD
## 1      7      0      2        1        2        6        1        2
## 2      6      2      2        0        4        5        0        5
## 3      3      2      4        4        4        2        0        5
## 4      5      2      2        2        3        4        3        4
## 5      7      1      2        2        4        4        5        4
## 6      0      6      3        3        5        2        0        5
##   MOPLLAAG MBERHOOG MBERZELF MBERBOER MBERMIDD MBERARBG MBERARBO MSKA
## 1        7        1        0        1        2        5        2    1
## 2        4        0        0        0        5        0        4    0
## 3        4        0        0        0        7        0        2    0
## 4        2        4        0        0        3        1        2    3
## 5        0        0        5        4        0        0        0    9
## 6        4        2        0        0        4        2        2    2
##   MSKB1 MSKB2 MSKC MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS MZPART
## 1     1     2    6    1      1      8     8     0     1       8      1
## 2     2     3    5    0      2      7     7     1     2       6      3
## 3     5     0    4    0      7      2     7     0     2       9      0
## 4     2     1    4    0      5      4     9     0     0       7      2
## 5     0     0    0    0      4      5     6     2     1       5      4
## 6     2     2    4    2      9      0     5     3     3       9      0
##   MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM MKOOPKLA PWAPART
## 1       0        4        5        0        0       4        3       0
## 2       2        0        5        2        0       5        4       2
## 3       4        5        0        0        0       3        4       2
## 4       1        5        3        0        0       4        4       0
## 5       0        0        9        0        0       6        3       0
## 6       5        2        3        0        0       3        3       0
##   PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO PVRAAUT PAANHANG PTRACTOR
## 1       0       0        6       0       0       0        0        0
## 2       0       0        0       0       0       0        0        0
## 3       0       0        6       0       0       0        0        0
## 4       0       0        6       0       0       0        0        0
## 5       0       0        0       0       0       0        0        0
## 6       0       0        6       0       0       0        0        0
##   PWERKT PBROM PLEVEN PPERSONG PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER
## 1      0     0      0        0       0       0      5       0        0
## 2      0     0      0        0       0       0      2       0        0
## 3      0     0      0        0       0       0      2       0        0
## 4      0     0      0        0       0       0      2       0        0
## 5      0     0      0        0       0       0      6       0        0
## 6      0     0      0        0       0       0      0       0        0
##   PFIETS PINBOED PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 1      0       0        0       0       0       0        1       0       0
## 2      0       0        0       2       0       0        0       0       0
## 3      0       0        0       1       0       0        1       0       0
## 4      0       0        0       0       0       0        1       0       0
## 5      0       0        0       0       0       0        0       0       0
## 6      0       0        0       0       0       0        1       0       0
##   AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG AGEZONG AWAOREG
## 1       0        0        0      0     0      0        0       0       0
## 2       0        0        0      0     0      0        0       0       0
## 3       0        0        0      0     0      0        0       0       0
## 4       0        0        0      0     0      0        0       0       0
## 5       0        0        0      0     0      0        0       0       0
## 6       0        0        0      0     0      0        0       0       0
##   ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND Purchase
## 1      1       0        0      0       0        0       No
## 2      1       0        0      0       0        0       No
## 3      1       0        0      0       0        0       No
## 4      1       0        0      0       0        0       No
## 5      1       0        0      0       0        0       No
## 6      0       0        0      0       0        0       No

Next, we look at each individual value.

var(Caravan[,1])
## [1] 165.0378
var(Caravan[,2])
## [1] 0.1647078
purchase <- Caravan[,86]
standardized.Caravan <- scale(Caravan[,-86])
print(var(standardized.Caravan[,1]))
## [1] 1
print(var(standardized.Caravan[,2]))
## [1] 1
# Train test split
## Test set 
test.index <- 1:1000
test.data <- standardized.Caravan[test.index,]
test.purchase <- purchase[test.index]
# Train set
train.data <- standardized.Caravan[-test.index,]
train.purchase <- purchase[-test.index]

Based on the above training/test set, we fit a KNN model.

library(class)

## KNN model
set.seed(101)
predicted.purchase <- knn(train.data,test.data,train.purchase,k=1)
head(predicted.purchase)
## [1] No No No No No No
## Levels: No Yes
# misclass
mean(test.purchase != predicted.purchase)
## [1] 0.116
misclass.error <- mean(test.purchase!=predicted.purchase)
print(misclass.error)
## [1] 0.116
# k=3 case misclassification rate = 0.073
predicted.purchase <- knn(train.data,test.data,train.purchase,k=3)
mean(test.purchase != predicted.purchase)
## [1] 0.073
# k=5 case
redicted.purchase <- knn(train.data,test.data,train.purchase,k=5)
mean(test.purchase != predicted.purchase)
## [1] 0.073
#### Choosing a K value
predicted.purchase = NULL
error.rate = NULL

for(i in 1:20){
    set.seed(101)
    predicted.purchase = knn(train.data,test.data,train.purchase,k=i)
    error.rate[i] = mean(test.purchase != predicted.purchase)
}
print(error.rate)
##  [1] 0.116 0.107 0.074 0.070 0.066 0.064 0.062 0.061 0.058 0.058 0.059
## [12] 0.058 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
# visualize K elbow method
library(ggplot2)
k.values <- 1:20
error.df <- data.frame(error.rate,k.values)
ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')

Lecture 104: K Nearest Neighbors Project

since KNN is such a simple algorithm, we will use this project as a simple exercise to test the implemention of KNN. For this project, just follow along with the bolded instructions. It should be very simple, so at the end, we will have an additional optional bonus project.

Get the data

We will use the famous iris data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.

First, we use the ISLR library to get the iris data set, and check the heading of the data frame.

library(ISLR)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Standardize Data

In this case, th iris data set has all its features in the same order ofd magnitutde, but its good practice (especially with KNN) to standardize features in the data. Let’s go ahead and do this even though its not necessary for this data.

Use scale() to standaradize the feature columns of the iris data set. Set this standardized version of the data as a new variale.

var(iris[,1])
## [1] 0.6856935
var(iris[,2])
## [1] 0.1899794
var(iris[,3])
## [1] 3.116278
var(iris[,4])
## [1] 0.5810063
species <- iris[,5]
standardized.iris <- scale(iris[,-5])

var(standardized.iris[,1])
## [1] 1
var(standardized.iris[,2])
## [1] 1
var(standardized.iris[,3])
## [1] 1
var(standardized.iris[,4])
## [1] 1
head(standardized.iris)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667
dim(standardized.iris)
## [1] 150   4

Train and Test splits

Then, we use the caTools library to split the standardized data into train and test sets. Use

library(magrittr)
iris.train <-  sample(150,105)
iris.test <- (-iris.train)
iris.train.df <- standardized.iris[iris.train,] %>% as.data.frame()
iris.test.df <- standardized.iris[iris.test,] %>% as.data.frame()
head(iris.test.df)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  -1.01843718  1.24503015    -1.335752   -1.311052
## 2  -1.13920048  0.09788935    -1.279104   -1.442245
## 3  -1.25996379  0.78617383    -1.222456   -1.311052
## 4  -0.05233076  2.16274279    -1.449047   -1.311052
## 5  -0.17309407  3.08045544    -1.279104   -1.048667
## 6  -0.89767388  1.01560199    -1.335752   -1.179859
library(caTools)
sample <- sample.split(iris$Species,SplitRatio=0.7)
train <- subset(iris,sample==TRUE)
test <- subset(iris,sample=FALSE)

Build an KNN Model

We call the class library, and use the knn function to predict species of the test set, with k=1.

# KNN: class package
library(class)

# knn model
set.seed(101)
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted.species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor virginica  versicolor
##  [73] virginica  versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  versicolor virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  versicolor virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
# misclassification rate
mean(test$Species!=predicted.species)
## [1] 0.02666667

Choosing a K Value

Based on the above results, we do have quite small data, but considering the optimal k values.

predicted.species <- NULL
error.rate <- NULL

for(i in 1:10){
  set.seed(101)
  predicted.species = knn(train[,1:4],test[,1:4],train$Species,k=i)
  error.rate[i]=mean(test$Species!= predicted.species)
}
print(error.rate)
##  [1] 0.02666667 0.03333333 0.03333333 0.03333333 0.02666667 0.02666667
##  [7] 0.01333333 0.02000000 0.02000000 0.02666667
# visualize K elbow method
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')

We notice that the error drops to its lowest for k values between 2-6 (The above result is not consistent with the solution). Then it begins to jump backup again, this is due to how small the data set is. At k=10, we begine to approach seeking k=10% of the data, which is quite large.

http://archive.ics.uci.edu/ml/index.php

Lecture 106: Tree Methods

Reference: ISLR Chapter 8

Decision tree - nodes: split for the value of a certain attribute - leaves: terminal nodes that predict the outcome - root: the node that performs the first split - leaves: terminal nodes that predict the outcome

Entropy and Information gain are the mathematical methods of choosing the best split. Entropy: \[H(S) = - \Sigma_i p_i(S) log_2p_i(S)\] Information Gain: \[IG(S,A)0=H(s)-\Sigma \frac{|S_v|}{S}H(S_v)\]

To improve performance, we can use many trees with a randome sample of features chosen as the split. - a new randome sample of features is chosen for every single tree at every single split. - for classification, m is typically chosen to be the square root of p.

Suppose there is a one very strong feature in the data set. When using “bagged” trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated. - Averaging highly correlated quantities does not significantly reduce variance. - By randomly leaving out candidate features from each split, random forest decorrelates the trees, such that the averaging process can reduce the variance of the resulting model.

Lecture 107: Decision Trees and Random Forests

Decision trees: rpart package

library(rpart)
str(kyphosis)
## 'data.frame':    81 obs. of  4 variables:
##  $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
##  $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
##  $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
head(kyphosis)
##   Kyphosis Age Number Start
## 1   absent  71      3     5
## 2   absent 158      3    14
## 3  present 128      4     5
## 4   absent   2      5     1
## 5   absent   1      4    15
## 6   absent   1      2    16
# model application
tree <- rpart(Kyphosis~.,data=kyphosis, ,method='class')
printcp(tree)
## 
## Classification tree:
## rpart(formula = Kyphosis ~ ., data = kyphosis, method = "class")
## 
## Variables actually used in tree construction:
## [1] Age   Start
## 
## Root node error: 17/81 = 0.20988
## 
## n= 81 
## 
##         CP nsplit rel error xerror    xstd
## 1 0.176471      0   1.00000 1.0000 0.21559
## 2 0.019608      1   0.82353 1.0588 0.22010
## 3 0.010000      4   0.76471 1.0588 0.22010
library(magrittr)
plot(tree,uniform=T, main='Kythosis Tree')
text(tree,use.n=T,all=T)

there are multiple functions available for decision tree model - printcp: display cp table - plotcp: plot cross-validation results - rsq.rpart: plot approximate R-squared and relative error for different splits, labels are only approproate for the anova method - print:print results - summary: detailed results including surrogate splits - plot: plot decision tree - text:label the decision tree plot - post: create postscript plot of decision tree

The above function can become better with package “rpart.plot”

library(rpart.plot)
prp(tree)

Randome free improves predictive accuracy by generating bootstrap trees based on random samples of variables classifying a case using history in the new forest sighting a final predicted outcome by combining the results across all of the trees and in classification.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf.model <- randomForest(Kyphosis~.,data=kyphosis)
print(rf.model)
## 
## Call:
##  randomForest(formula = Kyphosis ~ ., data = kyphosis) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 18.52%
## Confusion matrix:
##         absent present class.error
## absent      61       3   0.0468750
## present     12       5   0.7058824
rf.model$predicted
##       1       2       3       4       5       6       7       8       9 
##  absent  absent present  absent  absent  absent  absent  absent  absent 
##      10      11      12      13      14      15      16      17      18 
##  absent  absent  absent  absent  absent  absent  absent  absent  absent 
##      19      20      21      22      23      24      25      26      27 
##  absent  absent  absent present  absent present  absent  absent  absent 
##      28      29      30      31      32      33      34      35      36 
##  absent  absent  absent  absent  absent  absent  absent  absent  absent 
##      37      38      39      40      41      42      43      44      45 
##  absent  absent  absent  absent  absent  absent present  absent  absent 
##      46      47      48      49      50      51      52      53      54 
##  absent  absent  absent  absent  absent  absent  absent  absent  absent 
##      55      56      57      58      59      60      61      62      63 
##  absent  absent  absent present present  absent  absent present  absent 
##      64      65      66      67      68      69      70      71      72 
##  absent  absent  absent  absent  absent  absent  absent  absent  absent 
##      73      74      75      76      77      78      79      80      81 
##  absent  absent  absent  absent  absent  absent  absent present  absent 
## Levels: absent present
rf.model$ntree
## [1] 500
rf.model$confusion
##         absent present class.error
## absent      61       3   0.0468750
## present     12       5   0.7058824

Lecture 108: Deicision Trees and Randome Forests Project

For this project, we will be exploring the use of tree methods to classify schools as Private or Public based off their features. Let’s start by getting the data which is included in the ISLR library, the College data frame.

A data frame with 777 observations on the following 18 variables. -Private A factor with levels No and Yes indicating private or public university - Apps Number of applications received - Accept Number of applications accepted - Enroll Number of new students enrolled - Top10perc Pct. new students from top 10% of H.S. class - Top25perc Pct. new students from top 25% of H.S. class - F.Undergrad Number of fulltime undergraduates - P.Undergrad Number of parttime undergraduates - Outstate Out-of-state tuition - Room.Board Room and board costs - Books Estimated book costs - Personal Estimated personal spending - PhD Pct. of faculty with Ph.D.’s - Terminal Pct. of faculty with terminal degree - S.F.Ratio Student/faculty ratio - perc.alumni Pct. alumni who donate - Expend Instructional expenditure per student - Grad.Rate Graduation rate

Get the Data

We first call the ISLR library and check the head of College (a built-in data frame with ISLR, use data() to check this). Then reassign College to a dataframe called df.

library(ISLR)
library(magrittr)
df <- College
head(df,10)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
## Albertus Magnus College          Yes  353    340    103        17
## Albion College                   Yes 1899   1720    489        37
## Albright College                 Yes 1038    839    227        30
## Alderson-Broaddus College        Yes  582    498    172        21
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
## Albertus Magnus College             45         416         230    13290
## Albion College                      68        1594          32    13868
## Albright College                    63         973         306    15595
## Alderson-Broaddus College           44         799          78    10468
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
## Albertus Magnus College            5720   500     1500  90       93
## Albion College                     4826   450      850  89      100
## Albright College                   4400   300      500  79       84
## Alderson-Broaddus College          3380   660     1800  40       41
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55
## Albertus Magnus College           11.5          26   8861        63
## Albion College                    13.7          37  11487        73
## Albright College                  11.3          23  11644        80
## Alderson-Broaddus College         11.5          15   8991        52

Explotitation of Dataset (EDA)

We create a scatterplot of Grad.Rate versus Room.Board, colored by the Private column.

library(ggplot2)
ggplot(df,aes(x=Room.Board,y=Grad.Rate))+geom_point(aes(color=Private))

Then, we also create a histogram of full time undergrad students, color by Private.

ggplot(df,aes(x=F.Undergrad))+geom_histogram(aes(fill=Private),color='light grey')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Create a histogram of Grad.Rate colored by Private,

ggplot(df,aes(x=Grad.Rate))+geom_histogram(aes(fill=Private),color="light gray")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df,aes(x=Grad.Rate))+geom_histogram(aes(fill=Private),color="light gray",bins=50)

What college had a Graduation rate of above 100%?

library(magrittr)
head(df)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55
# subset() function
subset(df,Grad.Rate>100)
##                   Private Apps Accept Enroll Top10perc Top25perc
## Cazenovia College     Yes 3847   3433    527         9        35
##                   F.Undergrad P.Undergrad Outstate Room.Board Books
## Cazenovia College        1010          12     9384       4840   600
##                   Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Cazenovia College      500  22       47      14.3          20   7697
##                   Grad.Rate
## Cazenovia College       118

As we want to change the college’s grad rate to 100%, we simply re-define the data frame.

df['Cazenovia college','Grad.Rate'] <- 100

Train Test Split

Next, we split the data into training and testing sets 70/30, by using caTools library for this.

library(caTools)
set.seed(101)

sample = sample.split(df$Private,SplitRatio=.70)
train=subset(df,sample==TRUE)
test=subset(df,sample==FALSE)

Decision Tree

We use the rpart library to bild a decision tree to predict whether or not a school is Private. Remember to only build tree off the training data.

library(rpart)
tree <- rpart(Private~.,,method='class',data=train)

We use predict() to predict the Private label on the test data, and check the head of the predictied values. We could notice we do have two columns with the probabilities.

tree.preds <- predict(tree,test)
head(tree.preds)
##                                                  No       Yes
## Adrian College                          0.003311258 0.9966887
## Alfred University                       0.003311258 0.9966887
## Allegheny College                       0.003311258 0.9966887
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887
## Alma College                            0.003311258 0.9966887
## Amherst College                         0.003311258 0.9966887

We could turn these columns into one column to match the original Yes/No label for a private column.

tree.preds <- as.data.frame(tree.preds)

joiner <- function(x){
  if (x>=0.5){
    return('Yes')
  }else {
    return('No')
  }
}

tree.preds$Private <- sapply(tree.preds$Yes,joiner)
head(tree.preds)
##                                                  No       Yes Private
## Adrian College                          0.003311258 0.9966887     Yes
## Alfred University                       0.003311258 0.9966887     Yes
## Allegheny College                       0.003311258 0.9966887     Yes
## Allentown Coll. of St. Francis de Sales 0.003311258 0.9966887     Yes
## Alma College                            0.003311258 0.9966887     Yes
## Amherst College                         0.003311258 0.9966887     Yes

Based on the above, we use table() to create a confusion matrix of the tree model. We can plot such tree model by prp() function.

# confusion matrix
table(tree.preds$Private,test$Private)
##      
##        No Yes
##   No   57   9
##   Yes   7 160
# plot by prp()
library(rpart.plot)
prp(tree)

Random Forest

Let’s build out a random forest model. First, we call the randomForest package library, and use randomForest() to build out a model to predict Private class. Add importance=TRUE as a parameter in the model to find out what this does.

library(randomForest)
rf.model <- randomForest(Private~.,data=train,importance=T)

The model’s confusion matrix on its own training set is,

rf.model$confusion
##      No Yes class.error
## No  125  23  0.15540541
## Yes  10 386  0.02525253

To grab the feature importance with moel$importance, refer the reading for more info what Gini means.

rf.model$importance
##                      No          Yes MeanDecreaseAccuracy MeanDecreaseGini
## Apps        0.030638008 0.0138490702         0.0183834770         8.191846
## Accept      0.031590637 0.0157698867         0.0199998841        13.280573
## Enroll      0.036691206 0.0284928559         0.0307724157        20.832978
## Top10perc   0.008783509 0.0041456953         0.0053787323         5.250942
## Top25perc   0.007412220 0.0044694518         0.0052145181         4.501693
## F.Undergrad 0.154011049 0.0748721489         0.0962450156        40.463681
## P.Undergrad 0.048475334 0.0073411365         0.0184278295        15.656301
## Outstate    0.141095168 0.0613496370         0.0830099419        42.602561
## Room.Board  0.021518748 0.0144373926         0.0163748212        11.415365
## Books       0.001852497 0.0002509989         0.0007121824         2.248328
## Personal    0.003032677 0.0016272012         0.0019899372         3.288314
## PhD         0.006891102 0.0048891976         0.0054043435         4.401988
## Terminal    0.005552642 0.0041588988         0.0045469137         4.373904
## S.F.Ratio   0.031130358 0.0096029611         0.0153991012        15.809862
## perc.alumni 0.021092075 0.0041202501         0.0087216547         5.065061
## Expend      0.025496668 0.0134694646         0.0166589911        10.504956
## Grad.Rate   0.017449980 0.0055589660         0.0088252942         7.442815

Predictions

We use random forest model to predict the data set.

p <- predict(rf.model,test)
table(p,test$Private)
##      
## p      No Yes
##   No   58   7
##   Yes   6 162

It should have performed better than jut a single tree, how much better depends on whether you are emasuring recall, precision, or accuracy as the most important measure of the model.

Section 28: Support Vector Machines

Lecture 111: Introduction to Support Vector Machines

Reference: ISLR Chapter 11

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognizes patterns, used for classification and regression analysis.

Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.

New examples are then mapped into that same space and precited to belong to a category based on which side of the gap they fall on.

Lecture 112: Support Vector Machine with R

svm model application

# iris dataset
library(ISLR)
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:coefplot':
## 
##     extractPath
help('svm')
## starting httpd help server ...
##  done
# svm model application
model <- svm(Species ~ .,data=iris)
summary(model)
## 
## Call:
## svm(formula = Species ~ ., data = iris)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  51
## 
##  ( 8 22 21 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica

model prediction

pred.values <- predict(model,iris[1:4])
table(pred.values,iris[,5])
##             
## pred.values  setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

Parameters - cost:basically allows the support vector machines to have what’s known as a soft margin, parameter for the soft margin cost function which controls the influence of each individual support vector. - gamma:free parameter of the Gaussian radial basis function (small gamma implies that the class of this support vector is going to have an influence on the side in the class of vector).

# range:list of list and gamma
tune.results <- tune(svm,train.x=iris[1:4],train.y=iris[,5],kernel='radial',ranges=list(cost=c(0.1,1,10),gamma=c(0.5,1,2)))
summary(tune.results)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.04 
## 
## - Detailed performance results:
##   cost gamma      error dispersion
## 1  0.1   0.5 0.06000000 0.05837300
## 2  1.0   0.5 0.04000000 0.04661373
## 3 10.0   0.5 0.05333333 0.06126244
## 4  0.1   1.0 0.05333333 0.05258738
## 5  1.0   1.0 0.04666667 0.03220306
## 6 10.0   1.0 0.06000000 0.04919099
## 7  0.1   2.0 0.08000000 0.05258738
## 8  1.0   2.0 0.05333333 0.04216370
## 9 10.0   2.0 0.06000000 0.05837300
tune.results <- tune(svm,train.x=iris[1:4],train.y=iris[,5],kernel='radial',ranges=list(cost=c(0.5,1,1.5),gamma=c(0.1,0.5,0.7)))
summary(tune.results)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.1
## 
## - best performance: 0.02666667 
## 
## - Detailed performance results:
##   cost gamma      error dispersion
## 1  0.5   0.1 0.04000000 0.06440612
## 2  1.0   0.1 0.02666667 0.03442652
## 3  1.5   0.1 0.03333333 0.03513642
## 4  0.5   0.5 0.04666667 0.04499657
## 5  1.0   0.5 0.04000000 0.04661373
## 6  1.5   0.5 0.04666667 0.04499657
## 7  0.5   0.7 0.05333333 0.04216370
## 8  1.0   0.7 0.04666667 0.04499657
## 9  1.5   0.7 0.04666667 0.04499657

Based on the above results, cost=1.5,gamma=1

tuned.svm <- svm(Species~.,data=iris,kernel='radial',cost=1.5,gamma=0.1)
summary(tuned.svm)
## 
## Call:
## svm(formula = Species ~ ., data = iris, kernel = "radial", cost = 1.5, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1.5 
##       gamma:  0.1 
## 
## Number of Support Vectors:  50
## 
##  ( 4 23 23 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica

Lecture 114: Support Vector Machines Project

For this project, we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a very interesting year in 2016, so let’s check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It’s recommended you use the csv provided as it has been cleaned of NA values.

Here are what the columns represent: - credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. - purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”). - int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. - installment: The monthly installments owed by the borrower if the loan is funded. - log.annual.inc: The natural log of the self-reported annual income of the borrower. - dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income). - fico: The FICO credit score of the borrower. - days.with.cr.line: The number of days the borrower has had a credit line. - revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). - revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). - inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months. - delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. - pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Data

We open the loan_csv file and save it as a dataframe called loans.

loans <- read.csv('loan_data.csv')

We briefly check the aummary and structure of loans.

str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
##  credit.policy                 purpose        int.rate     
##  Min.   :0.000   all_other         :2331   Min.   :0.0600  
##  1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
##  Median :1.000   debt_consolidation:3957   Median :0.1221  
##  Mean   :0.805   educational       : 343   Mean   :0.1226  
##  3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
##  Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
##                  small_business    : 619                   
##   installment     log.annual.inc        dti              fico      
##  Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
##  1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
##  Median :268.95   Median :10.929   Median :12.665   Median :707.0  
##  Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
##  3rd Qu.:432.76   3rd Qu.:11.291   3rd Qu.:17.950   3rd Qu.:737.0  
##  Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
##                                                                    
##  days.with.cr.line   revol.bal         revol.util    inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.6   1st Qu.: 0.000  
##  Median : 4140     Median :   8596   Median : 46.3   Median : 1.000  
##  Mean   : 4561     Mean   :  16914   Mean   : 46.8   Mean   : 1.577  
##  3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 70.9   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :119.0   Max.   :33.000  
##                                                                      
##   delinq.2yrs         pub.rec        not.fully.paid  
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.00000   Median :0.0000  
##  Mean   : 0.1637   Mean   :0.06212   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :5.00000   Max.   :1.0000  
## 

Next, we convert the following columns to categorical data using factor() - inq.last.6mths - delinq.2yrs - pub.rec - not.fully.paid - credit.policy

loans$credit.policy <- factor(loans$credit.policy)
loans$inq.last.6mths <- factor(loans$inq.last.6mths)
loans$delinq.2yrs <- factor(loans$delinq.2yrs)
loans$pub.rec <- factor(loans$pub.rec)
loans$not.fully.paid <- factor(loans$not.fully.paid)

str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
##  $ delinq.2yrs      : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ pub.rec          : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
##  $ not.fully.paid   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...

EDA

Let’s use ggplot2 to visualize the data. We first create a historam of fico scores colored by not.fully.paid

library(ggplot2)

pl <- ggplot(loans,aes(x=fico))
pl <- pl + geom_histogram(aes(fill=not.fully.paid),color='light grey',bins=40,alpha=0.5)
pl + scale_fill_manual(values=c('green','red'))+theme_bw()

Then, we create a barplot of purpose counts, colored by not.fully.paid. Use position=dodge in the geom_bar argument.

pl <- ggplot(loans,aes(x=factor(purpose)))
pl <- pl+geom_bar(aes(fill=not.fully.paid),position="dodge")
pl+theme_bw()+theme(axis.text.x=element_text(angle=90,hjust=1))

We also create a scatterplot of fico score versus int.rate. Does the trend make sense? Play around with the color scheme if you want.

ggplot(loans,aes(x=int.rate,y=fico))+geom_point()+theme_bw()

ggplot(loans,aes(x=int.rate,y=fico))+geom_point(aes(color=not.fully.paid),alpha=0.5)+theme_bw()

Building the Model

Now its time to build a model.

Train and Test sets

We first split the data into training and test sets using the caTools library.

library(caTools)
sample <- sample.split(loans$not.fully.paid,SplitRatio=0.8)
loans_train <- subset(loans,sample==T)
loans_test <- subset(loans,sample=F)

We call the e1071 library as shown in the lecture. We now use the svm() function to train a model on the training dataset.

library(e1071)

# svm model application
loans_model <- svm(not.fully.paid~.,data=loans_train)
summary(loans_model)
## 
## Call:
## svm(formula = not.fully.paid ~ ., data = loans_train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.01724138 
## 
## Number of Support Vectors:  3233
## 
##  ( 2007 1226 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

We could use predict to predict new values from the test set suing the model.

str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : Factor w/ 28 levels "0","1","2","3",..: 1 1 2 2 1 1 1 1 2 2 ...
##  $ delinq.2yrs      : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ pub.rec          : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 2 1 1 1 ...
##  $ not.fully.paid   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
# svm model prediction
pred.values <- predict(loans_model,loans_test[1:13])
table(pred.values,loans_test[,14])
##            
## pred.values    0    1
##           0 8045 1533
##           1    0    0
table(pred.values,loans_test$not.fully.paid)
##            
## pred.values    0    1
##           0 8045 1533
##           1    0    0

Tuning the model We probably got some bad results. With the model classifying everything into one group. Lets tune the model to fix this issue.

We use the tune() function to test the different cost and gamma values. In the lecture, we showed how to do this by using train.x and train.y, but its usually simpler to just pass a formula. Try checking out help(tune) for mode details. This is the end of the project because tuning can take a long time (since its running a bunch of different models).

tune.results <- tune(svm,train.x=not.fully.paid~.,data=loans_train,kernel='radial',ranges=list(cost=c(1,10),gamma=c(0.1,1)))
model <- svm(not.fully.paid~., data=loans_train,cost=10,gamma=0.1)
predicted.values <- predict(model,loans_test[1:13])
table(predicted.values,loans_test$not.fully.paid)
##                 
## predicted.values    0    1
##                0 8010 1133
##                1   35  400

Section 30: K-Means Clustering

Lecture 116: Introduction

Reference: ISLR Chapter 10

K-Means Clustering is an unsupervised learning algorithm that will attempt to group similar clusters together in the data.So what does a typical clustering problem liik like? - cluster similar documents - cluster customers based on features - market segmentation - identify similar physical groups

The K Means Algorithm - choose a number of clusters “K” - randomly assign each point to a cluster - until clusters stop chaging, repeat the following: (1) For each cluster, cpmpute the cluster centroid by taking the mean vector of points in the cluster (2) Assing each data point to the cluster for which the centroid is the closest

There is no easy answer for choosing a best K value, but one way is the elbow method. - First of all, compute the sum of squared error (SSE) for some values of k. - The SSE is defined as the sum of the squaired distance between each member of the cluster and its centroid.

If we plot k against the SSE, we will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller.

The idea of the elbow method is to choose the k at which the SSe decreases abruptly. This produces an “elbow effect” in the graph, as we can see in the following picture:

Lecture 117: K-Means Clustering with R

library(ISLR)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
library(ggplot2)
pl <- ggplot(iris,aes(Petal.Length,Petal.Width))+geom_point(aes(color=Species),size=4)
pl

We conduct K means clustering for iris dataset.

set.seed(101)
irisCluster <- kmeans(iris[,1:4],3,nstart=20)
irisCluster
## K-means clustering with 3 clusters of sizes 62, 50, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3
## [106] 3 1 3 3 3 3 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3
## [141] 3 3 1 3 3 3 1 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

When we evaluate K means clustering, we can simply use table() function. In addition, we can visualize such data, by cluster package.

table(irisCluster$cluster,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36
# data visualization
library(cluster)
clusplot(iris,irisCluster$cluster,color=T,shade=T,labels=0,
         lines=0)

Lecture 118: K Means Clustering Project